Current version: 0.6.0-dev23. See
CHANGELOG.mdfor the granular dev-by-dev history and §17 Versioning Policy for how to bump it.
This document is the authoritative record of everything built in NetCortex, why each decision was made, how things are wired together, and the current operational state. It is written for a developer (or AI agent) picking up the project fresh and being asked to either extend or recreate it.
- What NetCortex Is
- High-Level Architecture
- Project Layout
- Secret Backend & Bootstrap
- Graph Data Model
- Platform Adapters
- SNMP Adapter — Deep Dive
- Graph Ingest & Correlation
- REST API
- Web UI
- Worker & Scheduling
- Docker Deployment
- Native Worker (macOS)
- Secrets Schema
- Known Issues & Workarounds
- Current Graph State
- Versioning Policy
- Recent Major Changes (since 0.1.0)
- Operational Data Quality (the dev17 → dev20 framework)
- Current Sprint State (dev23)
NetCortex is an intelligence layer that sits alongside NetBox. It connects to multiple network management platforms (Meraki, Catalyst Center, Intersight, Nexus Dashboard, vSphere, and any SNMP-capable device), discovers the actual network state, and stores it as a multi-dimensional graph in Neo4j.
NetBox is the source of truth for intended state. NetCortex reads site/location/serial data from NetBox to enrich graph nodes. NetCortex does not write back to NetBox (read-only consumer).
Neo4j is the operational graph store. It holds observed device adjacencies, STP trees, routing protocol peers, MAC/ARP tables, VLAN memberships, and IP address assignments — all simultaneously queryable in multiple "dimensions."
MCP is the AI interface. An MCP server exposes graph queries to AI agents (Claude, Cursor, etc.) so they can reason across the network without knowing any platform API.
┌──────────────────────────────────────────────────────────────────────┐
│ AI Agents (Claude / Cursor / custom) │
│ ▲ │
│ MCP (stdio / HTTP+SSE) │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ NetCortex Web (FastAPI) │
│ • /api/graph multi-dimensional topology (Cytoscape.js fmt) │
│ • /api/inventory flat device list │
│ • /api/cam correlated MAC/ARP table │
│ • /api/graph/stp STP tree per domain │
│ • /api/graph/routing L3 prefix + routing peer table │
│ • /api/status adapter health + graph stats │
│ • / interactive web UI │
└──────────────────────────────────────────────────────────────────────┘
│
┌────────┴────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Neo4j │ │ Redis │
│ (graph) │ │ (queue) │
└──────────┘ └──────────┘
▲
│ ingest
│
┌──────────────────────────────────────────────────────────────────────┐
│ NetCortex Worker (background) │
│ • Runs each adapter's discover() on a timer │
│ • Merges resulting GraphData into Neo4j │
│ • Runs correlation passes (MAC→device, CDP/LLDP→physical links) │
│ • Runs site correlation (NetBox site lookup by serial) │
└──────────────────────────────────────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Meraki │ │Catalyst │ │Intersight│
│ API │ │Center API│ │ API │
└──────────┘ └──────────┘ └──────────┘
│ │ │
┌──────────┐ ┌──────────┐
│ SNMP v3 │ │ NetBox │
│(devices) │ │ (SoT) │
└──────────┘ └──────────┘
Secret flow:
.env (AWS creds only)
→ AWS Secrets Manager
→ netcortex/core (neo4j, redis, netbox URLs)
→ netcortex/adapters/_index (which adapters are enabled)
→ netcortex/adapters/{type}/{instance} (per-adapter API keys)
→ netcortex/snmp/default (SNMP v3 credentials)
→ netcortex/snmp/device/{name} (per-device SNMP overrides)
netcortex/
├── adapters/
│ ├── base.py PlatformAdapter ABC + PlatformProfile
│ ├── __init__.py adapter registry: load_instances(), get_instances()
│ ├── meraki.py Cisco Meraki Dashboard API
│ ├── catalyst_center.py Cisco Catalyst Center (DNAC)
│ ├── intersight.py Cisco Intersight (UCS/HX/servers)
│ ├── nexus_dashboard.py Cisco Nexus Dashboard (NDFC)
│ ├── vsphere.py VMware vSphere
│ ├── generic_rest.py Schema-mapped generic REST
│ └── snmp.py SNMP v2c/v3 (IF-MIB, BRIDGE-MIB, LLDP, CDP,
│ OSPF, BGP, EIGRP, ipAddrTable, ipv6AddrTable)
├── graph/
│ ├── models.py GraphNode, GraphEdge, GraphData Pydantic models
│ │ NodeType + EdgeType enums
│ ├── ingest.py MERGE nodes / replace edges in Neo4j
│ ├── query.py Named Cypher queries (graph, inventory, STP,
│ │ routing, CAM, path-finding, stats)
│ ├── correlate.py Cross-adapter physical link correlation
│ ├── site_correlate.py NetBox serial→site lookup & compound nodes
│ ├── client.py Neo4j async driver init
│ └── schema.py Uniqueness constraints
├── snmp/
│ └── credentials.py SnmpCredentialResolver, SnmpContext enum
│ SnmpV3Creds / SnmpV2Creds models
├── models/
│ ├── device.py NormalizedDevice
│ ├── interface.py NormalizedInterface
│ ├── vlan.py NormalizedVLAN
│ └── topology.py NormalizedTopologyLink
├── status/
│ ├── router.py /api/status FastAPI router
│ └── templates/index.html Single-page web UI (Tailwind + Cytoscape.js)
├── config.py Settings (NetBox URL, Neo4j URI, Redis URL …)
├── secrets.py SecretBackend factory (AWS SM / Vault)
├── state.py In-process AppState (adapter health, graph counts)
├── main.py FastAPI app + all API endpoints
├── worker.py Background discovery loop
└── netbox.py pynetbox connectivity check
docs/
├── architecture.md Original design reference (partially superseded)
├── graph.md Graph-centric design reference
├── graph-topology.md Multi-layer topology model spec
├── implementation-journal.md ← THIS FILE — authoritative current state
├── secrets.md Secret path schema + IAM/Vault policies
├── adapters.md Adapter development guide
├── access-layer.md CLI/RESTCONF/NETCONF access layer spec
├── mcp-tools.md MCP tool reference
├── netbox-integration.md NetBox field mapping
├── status-page.md Status page spec
└── sync-engine.md Sync/diff engine spec
docker-compose.yml
Dockerfile
pyproject.toml
run_worker.sh Native macOS worker launcher (bypasses Docker NAT)
| Backend | Env vars required |
|---|---|
| AWS Secrets Manager | SECRET_BACKEND=aws_sm, AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY |
| HashiCorp Vault | SECRET_BACKEND=vault, VAULT_ADDR, VAULT_TOKEN (or AppRole) |
All env vars live in .env (Docker reads it via env_file). The .env file must never be committed.
netcortex.config.init_settings()readsSECRET_BACKENDfrom env.- The appropriate
SecretBackendis constructed. netcortex/coreis fetched:neo4j_uri,neo4j_user,neo4j_password,redis_url,netbox_url,netbox_token,netbox_verify_ssl,sync_interval.netcortex/adapters/_indexis fetched: list of{type, name, enabled}dicts.- For each enabled adapter,
netcortex/adapters/{type}/{name}is fetched and the adapter is instantiated.
netcortex/core.sync_interval (global default, seconds)
netcortex/adapters/{type}.sync_interval (per-adapter-type override)
netcortex/adapters/{type}/{instance}.sync_interval (per-instance override)
Each node and edge carries a dimension tag that routes it to the correct visual layer in the UI.
| Dimension | What it represents |
|---|---|
physical |
Physical cables, chassis, interfaces |
logical |
VLANs, SVIs, VRF memberships |
routing |
IP addresses, prefixes, OSPF/BGP/EIGRP peers |
stp |
Spanning-tree domains, port states/roles |
fabric |
EVPN VNIs, VXLAN overlays, fabric peers |
sdwan |
SD-WAN tunnels, policies |
virtual |
VMs, virtual networks (vSphere) |
| Label | What it is | Key properties |
|---|---|---|
Device |
Physical or virtual network device | name, role, platform, serial, mgmt_ip, model, os_version, snmp_polled, stub, status, status_history, status_changed_at, meraki_last_reported_at (ms), meraki_last_reported_at_iso (raw) |
Interface |
Network interface/port | name, device_id, mac, oper_status, speed |
VLAN |
802.1Q VLAN | vid, name, source |
VRF |
VRF/routing instance | name |
Prefix |
IP subnet (CIDR) | cidr, version (4 or 6), scope (vlan/vlan6/svi/svi6/static), kind (vlan_subnet/static_route/transit/wan), vlan_id, network_id, device_serial, next_hop |
IPAddress |
Assigned IP on an interface | address, version, subnet, device |
MACAddress |
Ethernet MAC address | mac, vendor, ip, vlan, source |
ARPEntry |
ARP/NDP binding | ip, mac, device, source |
STPDomain |
One STP instance (VLAN or MST) | root_bridge_mac, bridge_protocol, vlan |
RoutingPeer |
External routing peer (BGP/OSPF neighbor not in graph as Device) | name, protocol, peer_ip, router_id, remote_as, stub |
PlatformSite |
Platform-specific container (Meraki network, CATC site) | name, platform |
Site |
Canonical NetBox site | name, slug |
Location |
NetBox hierarchical location under a site | name |
AutonomousSystem |
External BGP AS (correlator-built) | asn, name, is_home, dimensions=['wan'] |
Internet |
Singleton public-Internet node (correlator-built) | id='internet:0', dimensions=['wan'] |
Source-of-truth timestamps on Device. The two meraki_last_reported_at* properties are populated by MerakiAdapter.discover() from the lastReportedAt field of getOrganizationApplianceUplinkStatuses. They power the dev19 staleness policy (see §19) — every top_problems device_down and link_down problem consults the A-side device's meraki_last_reported_at and demotes / filters the problem when the dashboard has not refreshed in top_problems_stale_after_seconds.
Status-history scalars on Device. status, status_history (JSON timeline ≤200 events, 7-day window), status_changed_at, plus four flap-stat scalars (status_flap_count_1h/_24h, status_flap_score_1h, status_flap_state) — see §19 for the universal field convention shared with transit-edge oper_status fields.
| Relationship | Meaning | Dimension |
|---|---|---|
PHYSICAL_LINK |
Cable between two devices (LLDP/CDP/API) | physical |
HAS_INTERFACE |
Device owns an interface | physical |
LOCATED_AT |
Device/Interface → PlatformSite or Location | physical |
WITHIN_LOCATION |
Location → parent Location or canonical Site | structural |
MAPS_TO_SITE |
PlatformSite → canonical Site (NetBox) | structural |
LOGICAL_MEMBER |
Interface carries a VLAN | logical |
HAS_SVI |
Device has SVI for a VLAN | logical |
ASSIGNED_IP |
Interface → IPAddress | routing |
ROUTES_TO |
Device → Prefix (from ipAddrTable/ipv6AddrTable) | routing |
ROUTING_PEER |
Device–peer L3 neighbor; protocol=ospf/bgp/eigrp | routing |
BGP_PEER |
BGP session (legacy; superseded by ROUTING_PEER) | routing |
VRF_MEMBER |
Interface/device belongs to VRF | routing |
LEARNED_MAC |
Interface learned a MAC (CAM table entry) | physical |
OWNS_MAC |
Device owns a MAC (NIC) | physical |
HAS_ARP |
Interface or MACAddress → ARPEntry (IP↔MAC) | physical |
STP_MEMBER |
Device participates in STP domain | stp |
STP_ROOT |
Device is root bridge for STP domain | stp |
STP_LINK |
Interface → STPDomain with port_state/port_role |
stp |
VNI_EXTENDS |
VNI maps to VLAN | fabric |
FABRIC_PEER |
VTEP-to-VTEP relationship | fabric |
VNI_MEMBER |
Device participates in VNI | fabric |
HAS_VM |
Host → VM | virtual |
VM_NETWORK |
VM → virtual network/port group | virtual |
SDWAN_TUNNEL |
SD-WAN tunnel | sdwan |
POLICY_APPLIES |
SD-WAN policy → device | sdwan |
WAN_UPLINK |
Device → Internet (mx_uplink) or Device → AutonomousSystem (ebgp); correlator-built |
wan |
TRANSITS |
AutonomousSystem → Internet (correlator-built) | wan |
Transit-edge operational properties (universal contract). Every edge in {PHYSICAL_LINK, WAN_UPLINK, SDWAN_TUNNEL, ROUTING_PEER} carries the same status-history schema — see §19 for the field list (oper_status + history + flap stats). Plus type-specific properties:
| Edge | Type-specific properties |
|---|---|
PHYSICAL_LINK |
interface_a, interface_b, interface_a_raw, interface_b_raw, discovery_proto, media_type, speed_mbps, speed_bps, health_score, util_pct, error_rate_per_s, l3_prefix_v4[], l3_prefix_v6[] |
WAN_UPLINK |
via (mx_uplink | ebgp), wan_slot (wan1 | wan2 for mx_uplink), public_ip, private_ip, asn, peer_ip (for ebgp), health_score, util_pct |
SDWAN_TUNNEL |
vpn_mode (hub | spoke), reachability (raw Meraki value: reachable/unreachable/unknown), tunnel_type (meraki_autovpn) |
ROUTING_PEER |
protocol, address_family, local_ip, remote_ip, local_as, remote_as, state, router_id, peer_node_id |
SDWAN_TUNNEL.oper_status (0.6.0-dev20). Derived from reachability via the adapter-level mapping _reachability_to_oper_status (netcortex/adapters/meraki.py):
reachable → up
unreachable → down
unknown / other / missing → None (oper_status not set)
The None case is intentional — _update_status_history filters WHERE r.oper_status IS NOT NULL, so tunnels the dashboard has no opinion on don't appear in the transition log as fake "unknown" state changes.
Site (NetBox canonical)
└── Location (optional, hierarchical)
└── PlatformSite (Meraki network, CATC site, etc.)
└── Device
This is expressed via the parent field on Cytoscape.js nodes, not as graph edges, so they render as nested compound containers.
Nodes with stub=True are placeholders created by SNMP discovery (LLDP/CDP neighbors, routing peers) that have not been verified as real devices. They are:
- Excluded from
GET /api/inventory - Visible in the topology graph (they contribute edges)
- Eligible for merging with real Device nodes by the correlator
Every adapter implements PlatformAdapter (netcortex/adapters/base.py):
class PlatformAdapter(ABC):
name: str # e.g. "meraki"
display_name: str # e.g. "Cisco Meraki"
instance_name: str # e.g. "CPN"
instance_id: str # e.g. "meraki/CPN" (name/instance_name)
profile: PlatformProfile # capabilities declaration
async def authenticate(self) -> None: ...
async def discover(self) -> GraphData: ...
async def health_check(self) -> dict: ...discover() returns a GraphData object (lists of GraphNode and GraphEdge). The worker calls discover() on every adapter, then calls ingest_graph_data() to upsert into Neo4j.
Adapter instances are loaded from netcortex/adapters/_index in the secret backend:
[
{"type": "meraki", "name": "CPN", "enabled": true},
{"type": "meraki", "name": "CPNGOV", "enabled": true},
{"type": "catalyst_center", "name": "cpn-ful-catc1", "enabled": true},
{"type": "nexus_dashboard", "name": "cpn-ful-nd1", "enabled": true},
{"type": "intersight", "name": "CPN", "enabled": true},
{"type": "snmp", "name": "default", "enabled": true}
]Multiple instances of the same type are fully supported (e.g., two Meraki orgs, two Catalyst Centers).
- Authenticates via API key in
netcortex/adapters/meraki/{name}(api_key,org_id,base_url) - Discovers: devices, networks, VLANs, clients (MAC/IP), LLDP adjacencies, STP per-port state, SD-WAN hub topology
- Produces: Device nodes grouped under PlatformSite (Meraki network), PHYSICAL_LINK edges from LLDP, LOGICAL_MEMBER for VLANs, STP_DOMAIN + STP_LINK + STP_ROOT for spanning tree, SDWAN_TUNNEL for hub-spoke
- Two separate instances (CPN and CPNGOV) with different base URLs (
api.meraki.comvsapi.gov.meraki.com) andverify_ssl=falsefor gov - SNMP polling is layered on top at cloud level (separate SNMP session to Meraki dashboard endpoint on custom port)
- Authenticates via username/password → JWT token
- Discovers: devices (inventory), interfaces, VLANs, topology links, MAC address tables (via CLI command runner), LLDP neighbors
- Produces: Device nodes with OS version, status; PHYSICAL_LINK from topology API; LOGICAL_MEMBER for VLANs; MACAddress + LEARNED_MAC from CAM tables
- Hostname deduplication:
cpn-ash-cat8k1.ciscops.netandcpn-ash-cat8k1are the same device — resolved by serial number match during correlation
- Authenticates via API key ID + RSA private key (request signing, stored in
netcortex/adapters/intersight/{name}) - Discovers: compute blades, rack units, HyperFlex clusters, server profiles, fabric interconnects (FIs), vNIC/NIC inventory
- Produces: Device nodes for servers and FIs; PHYSICAL_LINK edges from FI port → server vNIC associations; HAS_INTERFACE edges; LOGICAL_MEMBER for vNIC VLANs
- Authenticates via username/password → session token
- Discovers: fabric sites, VLANs, VNIs, VTEP peers, MAC tables from NDFC
- Produces: Device nodes, VLAN nodes, VNI nodes, FABRIC_PEER edges, VNI_EXTENDS, LEARNED_MAC
- Authenticates via vCenter REST API (username/password)
- Discovers: hosts, VMs, port groups, datastores
- Produces: Device nodes for ESXi hosts, HAS_VM edges to VM nodes, VM_NETWORK edges to virtual networks
The SNMP adapter is the most complex component. It provides a protocol-agnostic fallback for any device reachable via SNMP, and enriches data from other adapters with protocol-level detail (STP state, routing peers, MAC/ARP tables, IP addresses).
- No static device list required. Targets are read from Neo4j: any Device node with
mgmt_ipset is polled. - Hierarchical credential resolution. Per-device → per-adapter-type → global default (all from AWS Secrets Manager/Vault).
- Parallel polling. Up to
max_concurrent(default 20) devices polled simultaneously viaasyncio.Semaphore. - Hard timeouts. Per-walk 90s, per-device 300s, Neo4j write 30s — prevents any single device from blocking the cycle.
- No stub pollution. LLDP/CDP neighbor names are validated before creating nodes. Garbage names (binary data, pure integers, < 3 chars) are silently dropped.
netcortex/snmp/device/{device_name} → per-device override (highest priority)
netcortex/snmp/adapter/{adapter_type} → per-platform-type override
netcortex/snmp/default → global fallback
Each secret contains: username, auth_password, priv_password, auth_protocol (SHA/SHA256/MD5), priv_protocol (AES128/AES256/DES), security_level (authPriv/authNoPriv/noAuthNoPriv).
Meraki has two SNMP planes with different capabilities:
| Plane | Endpoint | Supported priv | What it sees |
|---|---|---|---|
| Cloud | snmp.meraki.com:port (from Dashboard API) |
AES | Org-wide: all devices, VLANs, STP |
| Device | Management IP:161 | DES only | Per-device: IF-MIB, STP ports |
The SnmpContext enum (CLOUD vs DEVICE) controls which credential set and which OIDs are used. The SnmpCredentialResolver enforces DES for device-level polls on Meraki regardless of the credential secret contents.
| Phase | MIBs | Data produced |
|---|---|---|
| 1 | SNMPv2-MIB, IF-MIB | sysDescr, sysUpTime, ifName, ifAlias, ifPhysAddress, ifOperStatus, ifSpeed |
| 2 | BRIDGE-MIB (CAM) | dot1dTpFdb → MACAddress + LEARNED_MAC edges |
| 2 | IP-MIB (ARP) | ipNetToMediaTable → ARPEntry + HAS_ARP edges |
| 3 | BRIDGE-MIB (STP) | dot1dStp scalars + port table → STPDomain + STP_MEMBER/ROOT/LINK |
| 3 | RSTP-MIB | port roles (backup/alternate/root/designated) |
| 4 | LLDP-MIB | lldpRemSysName/PortId/PortDesc → PHYSICAL_LINK stubs |
| 4 | CISCO-CDP-MIB | cdpCacheDeviceId/Port → PHYSICAL_LINK stubs |
| 5 | OSPF-MIB | ospfNbrTable → ROUTING_PEER edges (protocol=ospf) |
| 5 | BGP4-MIB | bgpPeerTable → ROUTING_PEER edges (protocol=bgp) |
| 5 | CISCO-EIGRP-MIB | cEigrpNbrTable → ROUTING_PEER edges (protocol=eigrp) |
| 6 | ipAddrTable (RFC 1213) | IPv4 addresses → IPAddress + ASSIGNED_IP + Prefix + ROUTES_TO |
| 6 | ipv6AddrTable (RFC 2465) | IPv6 addresses → same as above with version=6 |
A key source of bugs was pysnmp returning raw pyasn1 objects whose str() representation is binary garbage for OctetString fields. Three helpers were added:
_decode_display_str(val) # DisplayString/OctetString → clean UTF-8, strips non-printable
_decode_ip_val(val) # IpAddress → dotted-decimal; handles decimal integers too
_is_valid_neighbor_name(name) # Returns True only for plausible hostnames_decode_ip_val is critical for OSPF router IDs: some devices return the 32-bit router ID as a decimal integer (e.g., 1444263578). The function converts this via struct.pack("!I", int(s)) to 86.7.x.x.
After each poll cycle, _write_snmp_coverage() writes snmp_polled=True/False and snmp_polled_at=<timestamp> to each Device node in Neo4j. This is read by the status page to show ✓ catalyst_center/cpn-ful-catc1: 2/5 (2 devices polled of 5 targets).
- O(N²) problem resolved. Early versions used
any(n.id == x for n in data.nodes)to deduplicate nodes — O(N²) when N is thousands of LLDP entries. Replaced everywhere withseen: set[str]. - Data caps. LLDP:
max_neighbors=500. Routing peers:max_peers=200. Prevents internet-facing border routers from generating thousands of nodes. - Walk timeout. Each MIB walk has a 90s timeout via
asyncio.wait_for. Critical for devices with large or slowifNametables. - Device timeout. Each device poll has a 300s hard cap. A single unresponsive device cannot block the entire cycle for 5+ minutes.
ingest_graph_data(GraphData) →
1. Pre-compute content hashes for every node + edge (sha1 of canonical JSON)
2. Canonicalize undirected edges so source_id ≤ target_id (swap iface props)
3. Look up existing node/edge hashes from Neo4j
4. Purge stale edges for each rel_type owned by this adapter
5. MERGE nodes by id; skip rows whose stored _content_hash already matches
6. MERGE only changed edges; touch-only unchanged edges (`last_seen`) by key
7. MERGE edges by (src, dst, rel) — plus interface_a/interface_b for
multi-edge types (PHYSICAL_LINK) so parallel cables survive
Edge purge is scoped per (rel_type, source_adapter) — adapters do not accidentally delete each other's data.
Node MERGE uses id as the stable key. Properties are overwritten on each cycle. The stub flag must be explicitly set false by a real adapter to "promote" a stub node to a real device.
Multi-edge PHYSICAL_LINK schema (since 0.2.0). The relationship key
includes interface_a and interface_b (empty string instead of NULL)
so a switch with three cables to the same neighbor produces three
distinct Neo4j relationships instead of collapsing onto one.
_MULTI_EDGE_REL_TYPES controls which rel types behave this way —
currently only PHYSICAL_LINK. Content hashing follows the same
identity (_edge_identity()) so the hash table also keys per cable.
Runs after all adapters complete in this strict order:
_merge_neighbor_stubs_by_name()— LLDP/CDP stub Devices are re-keyed to a real Device with the same hostname (case-insensitive, first DNS label). Inbound and outboundPHYSICAL_LINKedges are redirected with the interface pair preserved, then the stub isDETACH DELETE-ed. A second pass collapses stub-to-stub groups (e.g.lldp-neighbor:fooandcdp-neighbor:foo) into a single canonical stub when no real device matches._correlate_via_mac()— Inserts aPHYSICAL_LINKedge taggedsource='correlated', discovery_proto='mac_correlation'whenever a switch port'sLEARNED_MACmatches a device'sOWNS_MAC. Skips any pair that already has an LLDP/CDP/native-topology edge in either direction._correlate_via_arp()— Same shape as MAC correlation but uses ARP entries on a switch interface that resolve to another device's assigned IP. Skips any pair already covered by LLDP/CDP/native or MAC correlation (ARP is the weakest signal)._dedupe_physical_links_by_pair()— Three-rule policy:- Group all
PHYSICAL_LINKedges by undirected pair(a, b)witha.id < b.id. - If any LLDP/CDP/native-topology edge exists for the pair, delete
every
mac_correlation/arp_correlationedge for that pair. - Sub-group the remaining edges by the canonical interface pair
tuple(sorted((iface_a, iface_b)))so parallel cables on distinct ports survive, then keep the highest-priority edge per sub-group (priority table in_PROTO_PRIORITY).
- Group all
_normalize_physical_link_interfaces()— Rewrites storedinterface_a/interface_bthroughnormalize_ifname()(Vl80→Vlan80,Twe1/1/5→TwentyFiveGigE1/1/5) as a safety net for legacy edges or adapters that bypassed normalization at creation._enrich_physical_links_with_health()— Copies per-interface util/error/health metrics onto thePHYSICAL_LINKedge.
Runs after correlation. Queries NetBox for each Device's serial number:
- If found in NetBox: uses NetBox's
site.nameandsite.slugto create/reference a canonicalSitenode and linkPlatformSite → SiteviaMAPS_TO_SITE - Creates compound node hierarchy:
Site → PlatformSite → Device(expressed as Cytoscape.jsparentfields) - Preserves the platform container name in node properties even when the canonical site overrides the visual grouping
The get_full_graph() query builds compound node parentage by:
- Walking
MAPS_TO_SITE,WITHIN_LOCATION,LOCATED_ATedges (marked as_STRUCTURAL_RELS) - Setting
data.parenton each child node to the container'sid - Never returning structural edges as Cytoscape edges — they are only used for parentage
All endpoints are in netcortex/main.py.
| Method | Path | Description |
|---|---|---|
| GET | / |
Web UI (single HTML page) |
| GET | /health |
Docker healthcheck — returns overall status + per-adapter status |
| GET | /api/status |
Full adapter health, graph stats, SNMP coverage |
| GET | /api/graph |
Topology graph (Cytoscape.js format); params: dimension, site, limit, include_interfaces, include_mac_nodes |
| GET | /api/graph/device/{name} |
2-hop subgraph around one device |
| GET | /api/graph/mac-table |
MAC address table (filterable by device/MAC) |
| GET | /api/graph/correlation |
Physical link correlation statistics |
| GET | /api/graph/path |
Shortest path between two devices (BFS) |
| GET | /api/graph/stats |
Node and relationship counts |
| GET | /api/graph/stp |
STP topology: domains → root bridges → members → port states |
| GET | /api/graph/routing |
L3 routing: prefixes (IPv4+IPv6) + routing peer table |
| GET | /api/graph/vlans |
VLAN inventory table rows; optional site / device filters |
| GET | /api/inventory |
Flat device list (excludes stub nodes) |
| GET | /api/cam |
Correlated MAC/ARP table with vendor, port, owner, IPs |
| POST | /api/adapters/refresh |
Re-check all adapter health (background) |
| POST | /api/adapters/sync |
Trigger full discovery + ingest cycle (background) |
The topology graph endpoint accepts ?dimension=physical|logical|routing|stp|fabric|sdwan|virtual. The dimension controls which edge types are returned:
_DIMENSION_RELS = {
"physical": [PHYSICAL_LINK, HAS_INTERFACE],
"logical": [LOGICAL_MEMBER, HAS_SVI, ASSIGNED_IP, VRF_MEMBER],
"routing": [ROUTES_TO, BGP_PEER, VRF_MEMBER, ROUTING_PEER, ASSIGNED_IP],
"stp": [STP_MEMBER, STP_ROOT, STP_LINK],
"fabric": [VNI_EXTENDS, FABRIC_PEER, VNI_MEMBER],
"sdwan": [SDWAN_TUNNEL, POLICY_APPLIES],
"virtual": [HAS_VM, VM_NETWORK, LOGICAL_MEMBER, HAS_SVI, VNI_MEMBER],
}The entire UI is a single Jinja2-rendered HTML file (netcortex/status/templates/index.html). It uses:
- Tailwind CSS (CDN) for styling
- Cytoscape.js + fcose layout for interactive network graphs
- Vanilla JavaScript (no framework) for data loading and rendering
| Tab | Content |
|---|---|
| Topology | Interactive graph; dimension buttons (Physical/Logical/Routing/STP/Fabric/SD-WAN/Virtual); search; layout picker |
| Inventory | Sortable/filterable device table — name, role, model, serial, IP, site, adapter, data sources, OS version, status |
| MAC / ARP Table | Correlated MAC table — MAC, vendor, learned-on device/port, VLAN, owner device, NIC, IPs |
| Spanning Tree | Per-STP-domain cards showing root bridge, member devices (sorted by path cost), and port states/roles |
| Routing | Network prefixes (IPv4+IPv6) with attached devices; routing peer table (OSPF/BGP/EIGRP) |
| VLANs | Filterable VLAN inventory table with member devices, sites, and source/provenance |
- Compound nodes: devices nest inside PlatformSite containers, which nest inside Location/Site containers
- Stable zoom: zoom/pan/dimension state is not reset on background data refresh (only on explicit dimension change)
- Node detail panel: clicking a device opens a side panel showing all properties
- Edge hover: hovering an edge shows a tooltip with interface names, discovery protocol, and other properties
- Color coding: each node type has a fixed color; edge types have semantic colors (red=STP_ROOT, green=ROUTING_PEER, blue=PHYSICAL_LINK, etc.)
Above the tabs, a collapsible table shows each adapter with:
- Status pill (connected / degraded / error)
- SNMP indicator for the
snmp/defaultadapter - Node/edge count contributed last cycle
- Refresh and Sync buttons
Each device row in Inventory shows colored pills for each data source:
meraki— data came from Meraki APIsnmp— device was successfully polled via SNMP- Additional sources can be added (future:
netconf,restconf)
netcortex/worker.py is the background process that:
- Loads all adapter instances (same code path as the web server)
- Runs a periodic loop: for each adapter, call
discover()→ingest_graph_data()→ correlation passes - Respects per-adapter
sync_intervalfrom the secret backend - Gates correlation on a full adapter round (one successful discover per configured instance) so correlator passes run on a coherent snapshot rather than a partial cycle
netcortex/core → default_sync_interval (e.g., 300 seconds)
netcortex/adapters/{type} → sync_interval (e.g., snmp: 600)
netcortex/adapters/{type}/{name} → sync_interval (e.g., for a slow platform)
Each adapter runs independently. A failure in one adapter does not block others. Errors are logged with structlog and the adapter status is updated in Neo4j for display in the UI.
services:
neo4j: # Graph database
redis: # Task queue / coordination
netcortex: # FastAPI web server (uvicorn)
# netcortex-worker: # Disabled on macOS — run natively insteadThe worker container is defined in docker-compose.yml but not started by default on macOS because Docker's network isolation prevents it from reaching private management IPs (10.x.x.x, 172.x.x.x) on the corporate network. See section 13.
neo4j: waits for Bolt port 7687 to accept connectionsredis: usesredis-cli pingnetcortex:GET http://localhost:8000/health— returns{"status": "healthy"}when Neo4j is connectednetcortex-worker: TCP connect to neo4j:7687 (Python one-liner, since redis-cli is not in the image)
docker compose build netcortex
docker compose up -dThe Dockerfile uses a multi-stage build: build stage installs all Python deps, runtime stage runs as non-root user netcortex.
Docker Desktop on macOS uses a Linux VM. Containers cannot reach private network IPs (e.g., 10.x.x.x device management IPs) without complex VPN routing. The SNMP adapter needs direct UDP:161 access to devices.
Solution: run netcortex.worker as a native macOS process. It connects to the containerized Neo4j and Redis via localhost:7687 and localhost:6379, but can reach any IP the Mac can route to.
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
# Load .env
set -a; source .env; set +a
# Point to Docker-hosted services
export NEO4J_URI="${NEO4J_URI:-bolt://localhost:7687}"
export REDIS_URL="${REDIS_URL:-redis://localhost:6379/0}"
export SYNC_BACKEND="${SYNC_BACKEND:-celery}"
exec /opt/homebrew/Caskroom/miniforge/base/bin/python3 -m netcortex.workerUsage:
# Install native dependencies (once)
pip install -e ".[all]"
# Start
nohup bash run_worker.sh > /tmp/nc_worker.log 2>&1 &
# Monitor
tail -f /tmp/nc_worker.log
# Stop
pkill -f netcortex.worker{
"neo4j_uri": "bolt://neo4j:7687",
"neo4j_user": "neo4j",
"neo4j_password": "...",
"redis_url": "redis://redis:6379/0",
"netbox_url": "https://netbox.example.com",
"netbox_token": "...",
"netbox_verify_ssl": false,
"default_sync_interval": 300
}netbox_verify_ssl defaults to true when omitted. Set it to false
for self-signed lab NetBox deployments.
[
{"type": "meraki", "name": "CPN", "enabled": true},
{"type": "meraki", "name": "CPNGOV", "enabled": true},
{"type": "catalyst_center", "name": "cpn-ful-catc1", "enabled": true},
{"type": "nexus_dashboard", "name": "cpn-ful-nd1", "enabled": true},
{"type": "intersight", "name": "CPN", "enabled": true},
{"type": "snmp", "name": "default", "enabled": true}
]{
"api_key": "...",
"org_id": "686235993220619936",
"base_url": "https://api.meraki.com/api/v1",
"verify_ssl": true
}{
"api_key": "...",
"org_id": "...",
"base_url": "https://api.gov.meraki.com/api/v1",
"verify_ssl": false
}{
"base_url": "https://cpn-ful-catc1.ciscops.net",
"username": "...",
"password": "...",
"verify_ssl": true
}{
"base_url": "https://intersight.com",
"api_key_id": "...",
"secret_key": "-----BEGIN EC PRIVATE KEY-----\n..."
}{
"username": "netcortex",
"auth_password": "...",
"priv_password": "...",
"auth_protocol": "SHA",
"priv_protocol": "AES128",
"security_level": "authPriv"
}Same structure as snmp/default. Takes precedence for that specific device.
# Create a secret
aws secretsmanager create-secret \
--name "netcortex/snmp/default" \
--secret-string '{"username": "netcortex", ...}'
# Update a secret
aws secretsmanager put-secret-value \
--secret-id "netcortex/snmp/default" \
--secret-string '{"username": "netcortex", ...}'
# Load key from file (Intersight)
aws secretsmanager put-secret-value \
--secret-id "netcortex/adapters/intersight/CPN" \
--secret-string "$(jq -n \
--arg key_id "$KEY_ID" \
--arg secret "$(cat secret_key.pem)" \
'{api_key_id: $key_id, secret_key: $secret}')"Symptom: snmp.walk.timeout logged for cpn-ful-cat8k1 and cpn-ash-cat8k1 after 90 seconds during ifName walk.
Cause: These devices have a very large interface table (hundreds of tunnel interfaces, subinterfaces, etc.) that takes >90s to walk via bulk SNMP.
Workaround: Increase walk_timeout in the SNMP adapter config, or add a per-device secret to skip certain MIBs. The _SnmpSession class accepts walk_timeout as a parameter.
Not yet done: Per-device MIB exclusion list.
LLDP/CDP stub Devices that lose all their relationships are now
garbage-collected by _housekeeping_loop() (see netcortex/worker.py).
The same loop also evicts orphan RoutingPeer, MACAddress,
ARPEntry, IPAddress, and Prefix nodes once they no longer have
any incoming edges.
Meraki device-level SNMP (direct poll on port 161) only supports DES for privacy. This is enforced by SnmpCredentialResolver which overrides priv_protocol=DES for Meraki targets in SnmpContext.DEVICE context. The global snmp/default can use AES128.
The _poll_ip_addresses() function was added in the most recent cycle. IPv6 addresses will appear after the worker completes its next full SNMP poll cycle. The ipv6AddrTable (OID 1.3.6.1.2.1.55.1.8) is queried for all SNMP-responsive devices.
Meraki STP data is collected via the Dashboard REST API (per-port state, root bridge election result). The API does not return the root bridge MAC directly. The root_bridge_mac field on STPDomain nodes from Meraki is therefore NULL; the root bridge is identified by the STP_ROOT edge instead.
As of 0.6.0-dev20 against the live development graph:
| Node type | Count | Primary sources |
|---|---|---|
| Device | ~354 | Meraki (~290), CATC (~5), Intersight (~50), NDFC (~10) |
| Interface | ~510 | Meraki port-statuses, CATC, Intersight, SNMP |
| Prefix | ~120 | Meraki appliance VLANs + static routes + switch SVIs, SNMP ipAddrTable |
| MACAddress | ~545 | Meraki clients, CATC hosts, NDFC, SNMP CAM |
| ARPEntry | ~227 | Meraki, SNMP, CATC |
| PlatformSite | ~108 | Meraki (networks), CATC (sites), NDFC (fabrics) |
| VLAN | ~106 | Meraki, CATC, NDFC |
| STPDomain | ~52 | Meraki, SNMP |
| AutonomousSystem | small | correlator (external eBGP peers only — home AS dropped in dev3) |
| Internet | 1 | correlator singleton |
| Transit edge type | Count | Note |
|---|---|---|
| PHYSICAL_LINK | ~133 | Meraki topology + LLDP/CDP + SNMP |
| WAN_UPLINK | ~54 | correlator-built; ~47 wan1 + ~7 wan2 slots + 3 ebgp (0.6.0-dev20: wan_slot exposed via links_list slim view) |
| SDWAN_TUNNEL | ~70 | Meraki AutoVPN — 41 up, 29 down (0.6.0-dev20: oper_status now derived from reachability) |
| ROUTING_PEER | ~1,300 | SNMP (OSPF + BGP) |
| Operational signal | Status |
|---|---|
top_problems critical count |
~30 (active SDWAN_TUNNEL outages dominate; staleness policy demoted dormant MX inventory to info) |
| Status-history coverage | All four transit edge types + Device.status tracked; 70/70 SDWAN_TUNNEL carry oper_status_history |
| Adapter source-of-truth timestamps | meraki_last_reported_at populated on ~290 Meraki Devices |
SNMP coverage: 2/5 Catalyst Center devices (cpn-ful-cat8k2, cpn-ash-cat8k2) are successfully polled. The two cat8k1 devices time out on ifName walk. Meraki cloud endpoint polling adds additional STP and neighbor data.
NetCortex follows Semantic Versioning 2.0. Two files must be kept in lockstep:
netcortex/__init__.py—__version__ = "x.y.z"pyproject.toml—version = "x.y.z"CHANGELOG.md— describe what changed
| Bump | Trigger |
|---|---|
| MAJOR | User-declared. Breaking changes or a named product milestone. |
| MINOR | A new feature — new adapter, new view, new MIB, new endpoint, new schema. |
| PATCH | A bug fix — behavior corrected without adding or removing functionality. |
Every commit that changes behavior must add a CHANGELOG.md entry
under the next-pending version section. Bump the appropriate digit at
the same time you commit the change (don't batch bumps).
A snapshot — the canonical record is CHANGELOG.md.
- SNMP v3 harvester rewritten on top of
net-snmp/snmpbulkwalk(thepysnmp7.x version deadlocked under concurrent load). - Per-adapter and per-instance sync-interval overrides.
- Multi-dimensional graph (physical / logical / routing / STP / fabric / SD-WAN / virtual) with Cytoscape compound parents.
- Stub merger, MAC + ARP correlation, dedupe with discovery-protocol priority, interface-name normalization, health enrichment.
- Per-port spanning-tree, per-VLAN logical membership, IPv4 + IPv6 prefix discovery via ipAddrTable / ipv6AddrTable.
- Data Explorer endpoint + view.
- Inventory data-source pills + per-adapter SNMP coverage.
- Multi-edge
PHYSICAL_LINKschema — parallel cables between the same two devices each become a distinct Neo4j edge (was: one collapsed edge that lost per-port detail). This required updates to ingest MERGE, content hashing, stub merger, dedupe, and the housekeeping reverse-edge collapse.
- Fixed Cytoscape edge-id collision for parallel
PHYSICAL_LINKedges.get_full_graph()andget_device_context()now include the Neo4j relationship id in the Cytoscape edge id.
- Strict overlay mode. UI now sends
strict_overlays=trueso an empty overlay selection returns nodes only (no edges) instead of the legacy "show everything". Devices without a PlatformSite parent are backfilled in nodes-only mode. Non-UI callers retain the old back-compat default. - Site grouping toggle. New Groups toolbar button shows/hides the compound Site/PlatformSite parents. State persists across page reloads.
- Multi-overlay topology. The single-dimension picker is replaced
by toggleable overlays — Physical, L2 (VLAN+STP), L3 (Routing),
SD-WAN, Fabric (EVPN), Virtual — selectable in any combination.
Backend accepts
?overlay=(repeatable) and returns the UNION of the selected edge types. The legacy?dimension=parameter still works. UI overlay state persists inlocalStorage. - MAC vendor enrichment. A new correlation pass
(
_enrich_mac_vendors) annotates everyMACAddresswith its IEEE vendor via an in-memory OUI table (netcortex.util.oui,mac-vendor-lookup>=0.1.15). Locally administered MACs return an empty string so randomized client MACs don't pollute the table. - Header version pill is now visible (bordered monospace badge instead of muted gray text).
The 0.5.0 release line and the early-0.6.0 dev cycle introduced
NetCortex's MCP transport, the four-phase agentic-ops surface
(status-history correlator → connectivity-strip UI → Links table →
agentic-ops MCP tools), the streamable-HTTP /mcp/ mount, and
21+ agentic-ops MCP tools. Per-release detail lives in
CHANGELOG.md; the design rationale lives in
docs/agentic-ops.md and
docs/mcp-tools.md.
A four-release arc that took top_problems from "technically correct
but operationally unusable" to "ranked, actionable, source-of-truth-
backed". Each release exists because the previous one's fix was
necessary but insufficient — together they form the contract
documented in §19.
- dev17 —
apply_transitionseed branch no longer fakes a<field>_changed_atstamp on first observation. The seed writes history JSON (so the connectivity strip has data) but defers the_changed_atanswer to_stamp_freshness, which backfills fromfirst_seen. Before this, every long-standing-down link reported as "just went down at " in a 30-ms cluster on first boot. Includes a one-shot Cypher cleanup snippet for graphs that had already been corrupted. - dev18 —
_infer_wan_topologysnapshot/restore was missingr.oper_statusitself. The correlator deletes and re-MERGEs every correlator-owned WAN_UPLINK every cycle; without snapshottingoper_status, the freshly-recreated edge looked like a transition to the enrichment query, which re-stamped_changed_atevery cycle. Fix: snapshot AND restoreoper_statusalongside the history JSON and flap scalars, usingcoalesceso partially- populated snapshots are handled cleanly. - dev19 — Cross-verification against the Meraki dashboard
revealed that the remaining
criticallink_downentries were accurate but mostly not actionable — ~17 of 19 reported MX uplinks were on appliances Meraki itself last heard from months ago. Introduces the source-of-truth staleness policy: everydevice_downandlink_downproblem consults the device'smeraki_last_reported_atand is demoted (or filtered) when stale. Two new config keys (top_problems_stale_after_seconds,top_problems_stale_severity) live in thenetcortex/coresecret. See §19 for the full contract. Addsnetcortex.util.timestamps.iso_to_epoch_ms. - dev20 — A second cross-verification against Meraki + Catalyst
Center exposed six data-quality gaps where the graph either undersold
what the source-of-truth already had, or lost information between the
adapter and the MCP-tool projection. All six fixed in one drop:
- SDWAN_TUNNEL.oper_status from Meraki reachability — Meraki
adapter now maps each peer's
reachability(reachable/unreachable) onto canonicaloper_status(up/down). This wires SD-WAN tunnels into the existing history correlator AND thetop_problemslink_downcheck, so SD-WAN-only outages now surface alongside physical and WAN_UPLINK outages. The dev19 staleness policy applies unchanged via the A-side MX'smeraki_last_reported_at."unknown"peers leaveoper_statusunset (history correlator filters NULLs). - Prefix.kind discriminator — Meraki adapter stamps a small
operator-facing taxonomy onto every Prefix:
vlan_subnetforvlan/vlan6/svi/svi6scopes,static_routeforstatic. Future scopes (transit,wan) slot in without schema changes. - Catalyst Center per-switch MAC-address-table fallback —
section 5 of CATC discover already creates LEARNED_MAC edges when
/v1/hostreturnsconnectedNetworkDeviceId+connectedInterfaceName. New section 5b walks/network-device/{deviceId}/mac-address-tableper switch as a fallback so port↔MAC binding gets stitched even when the assurance pipeline is empty. Best-effort: schema variations (interfaceNumber/ifName/portName/interface) are handled; per-switch failures degrade to log.debug. - WAN_UPLINK per-slot visibility —
_infer_wan_topologyhas always created one WAN_UPLINK edge per slot (wan1/wan2), distinguished bywan_slot.links_listpreviously droppedwan_slotfrom the slim projection; both edges looked identical to an agent.iface_anow folds inr.wan_slotvia COALESCE, and the slim view exposeswan_slot,via, andsource_adapteras first-class fields. links_listexposessource_adapter— agents can now tell adapter-discovered cables (meraki, catalyst_center, snmp) apart from correlator-built edges (WAN uplinks to Internet, AS boundary peers) without a second graph round-trip.- Meraki device-name canonicalisation — dashboard names with
trailing/leading/internal whitespace (e.g.
"Home MX ") are now trimmed and collapsed at ingest via_norm_device_name. Cross-system joins (NetBox lookups,top_problemsgrouping, history keys) stop silently missing matches.
- SDWAN_TUNNEL.oper_status from Meraki reachability — Meraki
adapter now maps each peer's
Three new pure helpers in netcortex/adapters/meraki.py
(_reachability_to_oper_status, _scope_to_prefix_kind,
_norm_device_name) own these decision boundaries and are
unit-tested in tests/adapters/test_meraki_helpers.py with 24
parametrised cases. The CATC walk uses import asyncio for a
semaphore-bounded concurrent fan-out.
This section captures the contracts that the dev17–dev20 arc made load-bearing. A future AI rebuilding the system from scratch should implement these invariants from day one, not retrofit them under operator pressure.
top_problems is the hero MCP tool. An agent calls it first, takes
the rank at face value, and drills in from there. If the ranking is
wrong — either because timestamps are fake (dev17 / dev18) or because
critical-severity rows are actually stale inventory the dashboard
itself has given up on (dev19) — the agent gets misled, the operator
loses trust, and the whole agentic-ops surface collapses to a manual
Cypher session.
Three independent failure modes existed in 0.6.0-dev16:
- Manufactured transitions. Status-history scalars (
_changed_at,_history) were stamped on every cycle even when nothing changed, so the rank-by-recency order was meaningless. - No source-of-truth staleness signal. A WAN_UPLINK on an MX the
dashboard hadn't heard from in 90 days reported with the same
criticalseverity as one Meraki polled five minutes ago. - Schema drops between adapter and MCP projection. Information
the adapter had (Meraki
reachability, wan slot, source adapter, CATC switch MAC tables, Meraki prefix scope) was either not promoted onto the graph or was dropped by the slim view, leavingtop_problemsunable to surface SDWAN outages, per-WAN-slot visibility, or port↔MAC binding.
dev17, dev18, dev19, dev20 — each release fixed exactly one of these modes, and the contracts below are the result.
Every tracked operational field on every tracked element follows the
same six-property schema. The math lives in netcortex/graph/history.py
(unit-tested in tests/graph/test_history.py); the per-cycle
application happens in _update_status_history in
netcortex/graph/correlate.py.
<field> — current value, e.g. "up"
<field>_changed_at — epoch_ms of the last *real* transition
<field>_history — JSON: [[at_ms, new_state], ...] (≤200 events, 7-day window)
<field>_flap_count_1h
<field>_flap_count_24h
<field>_flap_score_1h — count_1h / 6.0, saturated at 1.0
<field>_flap_state — "stable" | "unstable" | "flapping"
Classification:
- flapping = ≥5 transitions in the last hour
- unstable = ≥5 transitions in the last 24h but not the last hour
- stable = neither
Tracked fields today:
| Element | Field | Source |
|---|---|---|
Device |
status |
Adapter (Meraki, CATC, …) |
PHYSICAL_LINK |
oper_status |
Correlator (_enrich_*_health) |
WAN_UPLINK |
oper_status |
Correlator (_enrich_wan_uplinks_with_health) |
SDWAN_TUNNEL |
oper_status |
Adapter via _reachability_to_oper_status (dev20) |
ROUTING_PEER |
oper_status |
Adapter / SNMP |
Three invariants enforced across all tracked fields:
| Invariant | Where enforced | Why |
|---|---|---|
_changed_at only on real transitions |
apply_transition in history.py — seed branch writes history but NOT _changed_at |
A seed event is "we just started tracking", not "the network just changed" |
_changed_at backfilled from first_seen on edges without one |
_stamp_freshness in correlate.py |
The UI needs something to draw; "first time we saw this edge in its current state" is the honest answer |
| Destructive correlator rebuilds preserve state across the cycle | _infer_wan_topology snapshot/restore captures history JSON, flap scalars, _changed_at, first_seen AND oper_status itself |
Without oper_status in the snapshot, the next enrichment query sees prev_oper IS NULL and fakes a transition every cycle (dev18 root cause) |
top_problems device_down and link_down rows consult the A-side
device's meraki_last_reported_at. The policy is configurable via
two netcortex/core secret keys with defaults shown:
top_problems_stale_after_seconds: 86400 # 24 h
top_problems_stale_severity: info # "critical"|"warning"|"info"|"filter"The decision matrix:
Meraki lastReportedAt |
Resulting severity |
|---|---|
| within the threshold | unchanged (critical) |
| older than threshold, severity≠filter | demoted to top_problems_stale_severity |
| older than threshold, severity=filter | omitted from the response |
| missing (non-Meraki, never reported) | unchanged — fail open so other adapters aren't silenced |
Every demoted row carries a stale: true flag and a
stale_seconds: N evidence field, so an agent that wants to widen
its query can still see the inventory.
top_problems_stale_severity is validated in Settings.hydrate — an
unknown value logs a warning and falls back to the in-memory default.
Pure helpers in netcortex/adapters/meraki.py own the decision
boundary between platform-native values and canonical graph values.
The "pure" constraint matters: each helper is a single-expression
function with no I/O, registered with parametrised unit tests in
tests/adapters/test_meraki_helpers.py. A future AI extending this
should follow the same pattern — never embed the mapping inline in
discover().
| Helper | Input | Output | Notes |
|---|---|---|---|
_norm_device_name |
dashboard name | trimmed + internal whitespace collapsed | Apply at ingest; cross-system joins (NetBox, history keys) depend on the canonical form |
_reachability_to_oper_status |
Meraki reachability |
up / down / None |
None for unknown/missing — the history correlator's WHERE oper_status IS NOT NULL filter then keeps fake "unknown" transitions out of the timeline |
_scope_to_prefix_kind |
Meraki prefix scope | vlan_subnet / static_route / None |
Extensible: future scopes (transit, wan) slot in without changing call sites |
The slim view used by links_list (netcortex/mcp/tools/agentic_ops.py)
is the authoritative agent-facing surface for transit edges. Any
field that an agent might filter on, or might use to disambiguate
two otherwise-identical edges, MUST appear in the slim projection —
even if it's empty for some edge types. As of dev20 the slim view
is the union of:
- the universal status-history fields (
oper_status,oper_status_flap_state,oper_status_flap_score_1h,oper_status_changed_at,oper_status_history), - the type-specific operational fields listed in §5,
- and three provenance/disambiguator fields:
source_adapter—meraki/*,catalyst_center/*,snmp/*, or empty for correlator-built edges.wan_slot—wan1/wan2for dual-WAN MX uplinks; empty otherwise.via—mx_uplink/ebgpfor correlator-built WAN_UPLINK edges; empty otherwise.
get_links in netcortex/graph/query.py also COALESCEs
r.wan_slot into the canonical iface_a field so dual-WAN edges
read as wan1 / wan2 in the same column that physical-link edges
use for their port names. This makes the same query work for all
transit edge types.
| Version | Fix | Lives in |
|---|---|---|
| 0.6.0-dev17 | _changed_at no longer stamped on seed |
history.apply_transition, correlate._stamp_freshness |
| 0.6.0-dev18 | oper_status preserved across WAN rebuilds |
correlate._infer_wan_topology snapshot/restore |
| 0.6.0-dev19 | Staleness policy demotes dormant inventory | mcp.tools.agentic_ops._apply_staleness_policy, Settings.top_problems_stale_* |
| 0.6.0-dev20 (Fix #1) | SDWAN reachability → oper_status |
meraki._reachability_to_oper_status |
| 0.6.0-dev20 (Fix #2) | WAN_UPLINK per-slot visibility | query.get_links + slim projection in agentic_ops.links_list |
| 0.6.0-dev20 (Fix #3) | links_list exposes source_adapter |
slim projection in agentic_ops.links_list |
| 0.6.0-dev20 (Fix #4) | CATC MAC-table fallback | catalyst_center.discover section 5b |
| 0.6.0-dev20 (Fix #5) | Prefix.kind taxonomy | meraki._scope_to_prefix_kind + list_prefixes |
| 0.6.0-dev20 (Fix #6) | Device-name canonicalisation | meraki._norm_device_name |
The CHANGELOG entries for dev17–dev20 carry the full prose rationale; this index is the cheat-sheet for "which file owns this invariant?".
- Create
netcortex/adapters/myplatform.pyimplementingPlatformAdapter. - Implement
authenticate(),health_check(), anddiscover()(must returnGraphData). - Register in
pyproject.tomlunder[project.entry-points."netcortex.adapters"]:myplatform = "netcortex.adapters.myplatform:MyPlatformAdapter"
- Add an instance to
netcortex/adapters/_indexin the secret backend. - Create the config secret at
netcortex/adapters/myplatform/{instance_name}.
# Connect to Neo4j
docker exec -it netcortex-neo4j cypher-shell -u neo4j -p netcortex
# Example queries
MATCH (d:Device) WHERE d.snmp_polled = true RETURN d.name, d.mgmt_ip;
MATCH (d:Device)-[:STP_ROOT]->(dom:STPDomain) RETURN d.name, dom.root_bridge_mac;
MATCH (a:Device)-[r:ROUTING_PEER]->(b) RETURN a.name, r.protocol, b.name LIMIT 20;
MATCH (d:Device)-[:ROUTES_TO]->(p:Prefix) RETURN d.name, p.prefix ORDER BY p.prefix;
MATCH (d:Device) WHERE d.stub = true RETURN count(d);| Decision | Rationale |
|---|---|
| Neo4j as the graph store | Native graph queries, Cypher language, Cytoscape.js integration; pluggable via GraphBackend interface |
| No separate database | NetBox is the SoT for intended state; Neo4j is for observed/operational state only |
| Secrets never in code or NetBox | External secret backend (AWS SM / Vault) is the only place credentials live |
| Native worker on macOS | Docker network isolation blocks SNMP to private management IPs; native process has full routing table access |
stub flag on unverified nodes |
LLDP/CDP/OSPF discovery creates neighbor references that may or may not be real devices; stub flag prevents inventory pollution while keeping topological edges |
| Set-based deduplication in SNMP | O(N²) list scans caused minute-long hangs when processing thousands of LLDP/routing entries; O(1) set lookups fixed this |
| Per-walk SNMP timeouts | A single unresponsive device's large MIB table could block the asyncio event loop for the entire cycle; asyncio.wait_for wraps every walk |
| Dimension-based graph filtering | A single graph contains all topology layers; the UI filters to one dimension via edge type allow-lists rather than maintaining separate graphs |
| Pure helper functions own canonical mappings (dev20) | Decision boundaries between platform values and graph values must be unit-testable in isolation; embedding them inline in discover() makes regressions invisible |
| Source-of-truth staleness > generic timeout (dev19) | The dashboard already knows when it last heard from a device; consulting that signal (rather than wall-clock time) means dormant inventory stops dominating top_problems without dropping genuinely fresh-but-still-down problems |
Status-history _changed_at only on real transitions (dev17/18) |
A correlator-side seed event is not a network event; faking the timestamp on first observation poisons every "rank by recency" query downstream |
The dev19 and dev20 fixes both started with a cross-verification session against the source-of-truth platforms (Meraki dashboard, Catalyst Center). This appendix captures the repeatable playbook so the next agent doesn't have to rediscover it.
- Before bumping a major or minor version.
- When
top_problemsstarts returning results that "feel" wrong (too many criticals, suspicious clustering of timestamps, missing outages an operator just saw). - After adding a new adapter or a new correlator pass that touches transit edges.
# Pseudocode — replace with the actual MCP tool calls / adapter APIs.
nc_inventory = mcp.netcortex.inventory_list(limit=500)
nc_links = mcp.netcortex.links_list(limit=500)
nc_problems = mcp.netcortex.top_problems(limit=200)
meraki_devices = meraki.getOrganizationDevices(org_id)
meraki_uplinks = meraki.getOrganizationApplianceUplinkStatuses(org_id)
meraki_vpn = meraki.getOrganizationApplianceVpnStatuses(org_id)
catc_hosts = catc.get_host_table()
catc_macs = [catc.get_device_mac_table(dev_id) for dev_id in switch_ids]Always pull paginated results to exhaustion — partial pulls have fooled past verification runs into reporting fake "missing" inventory.
The two sides use different canonical keys:
| Concept | Meraki | NetCortex |
|---|---|---|
| Device | serial |
Device.serial (preferred) or Device.name |
| MX uplink | (serial, interface) (wan1/wan2) |
WAN_UPLINK(Device → Internet, wan_slot=…) |
| AutoVPN tunnel | (network_id, peer_network_id) |
SDWAN_TUNNEL(Device → Device) |
| Prefix | (cidr, scope) |
Prefix(cidr, scope, kind) |
Trim whitespace, lower-case where appropriate, and use the same canonical form on both sides before diffing.
| Pattern | What it usually means |
|---|---|
| In Meraki, not in NetCortex | Missing adapter call, pagination cap hit, or correlator dropped the entity |
| In NetCortex, not in Meraki | Stale inventory the housekeeping loop hasn't garbage-collected, OR Meraki removed it without us noticing |
| In both but property mismatch | Adapter parsing bug, correlator overwriting adapter value, or MCP slim view dropping the property |
The third pattern is the most insidious — it's the one that produced all six dev20 fixes.
Use one row per discrepancy with these columns:
- Pattern (one of the three above)
- Entity (canonical id)
- NetCortex value (what we expose)
- Meraki value (what the dashboard shows)
- Suspected location (file:function)
- Severity (does it affect
top_problemsranking? agent decisions? UI accuracy?) - Proposed fix (one of: adapter normalisation, correlator, MCP projection, schema, policy)
Each dev release in the chain should solve exactly one class of problem, ship with unit tests for any new helper, bump the version, and update CHANGELOG + this journal in the same commit. Don't batch unrelated fixes — the chain of evidence in the dev17 → dev18 → dev19 → dev20 arc only worked because each release could be verified independently.
Use the same scripts (with the version bumped in any version assertions). If a new discrepancy appears that wasn't visible before, you've likely uncovered a second-order effect — log it and plan a follow-up release. If the targeted discrepancy disappeared and nothing else broke, ship.
A self-contained Python script for running the targeted checks directly against Neo4j (bypasses MCP, useful when the MCP layer itself is under suspicion):
# /tmp/nc_verify.py — run inside the netcortex container:
# docker compose exec netcortex python /tmp/nc_verify.py
import asyncio, os
from netcortex.config import init_settings, get_settings
from netcortex.graph.client import init_client, run_query
async def main() -> None:
await init_settings()
s = get_settings()
await init_client(s.neo4j_uri, s.neo4j_user, s.neo4j_password)
print("=== SDWAN_TUNNEL oper_status distribution ===")
rows = await run_query(
"MATCH ()-[r:SDWAN_TUNNEL]->() "
"RETURN coalesce(r.oper_status, 'unset') AS s, count(r) AS n "
"ORDER BY n DESC"
)
for r in rows:
print(f" {r['s']:>10} {r['n']}")
print("=== Prefix.kind distribution ===")
rows = await run_query(
"MATCH (p:Prefix) "
"RETURN coalesce(p.kind, 'unset') AS k, count(p) AS n "
"ORDER BY n DESC"
)
for r in rows:
print(f" {r['k']:>14} {r['n']}")
print("=== Devices with trailing whitespace in name (should be 0) ===")
rows = await run_query(
"MATCH (d:Device) WHERE d.name <> trim(d.name) "
"RETURN d.name AS name, d.serial AS serial LIMIT 50"
)
for r in rows:
print(f" {r['serial']:>14} {r['name']!r}")
print("=== WAN_UPLINK per-slot counts ===")
rows = await run_query(
"MATCH ()-[r:WAN_UPLINK]->() "
"RETURN coalesce(r.wan_slot, '∅') AS slot, "
" coalesce(r.via, '∅') AS via, count(r) AS n "
"ORDER BY slot, via"
)
for r in rows:
print(f" slot={r['slot']:>4} via={r['via']:>10} {r['n']}")
asyncio.run(main())Keep verification scripts in /tmp/ (not the repo) — they exist
to capture a moment in time, not to become long-lived test
fixtures. Anything worth keeping graduates into
tests/integration/ with proper Pytest scaffolding.
This section captures the current operator-facing behavior as of
0.6.0-dev23, including sync controls, Meraki polling defaults, and
MX state semantics.
- Added per-instance sync endpoint:
POST /api/adapters/{adapter_type}/{instance_name}/sync
- Existing global sync endpoint remains:
POST /api/adapters/sync
- Adapter table now includes per-row Sync now.
- While active, the same button flips to Syncing… and shows spinner.
- Running state is backend-driven by
AdapterStatus.sync_running(exposed in/api/status) and reconciled by a per-adapter UI watcher so the button clears quickly when the adapter finishes.
- Default Meraki sync interval is now 60 minutes (
3600s):Settings.sync_interval_meraki = 3600- docs/examples updated in
README.md,docs/sync-engine.md, anddocs/secrets.md.
- Explicit secret values still override built-in defaults.
Historically, many Meraki MX nodes appeared status=active even when
both WAN circuits were down, because device inventory status and
per-uplink state came from different signals.
Current behavior:
WAN_UPLINK.oper_statuscontinues to use per-uplink Meraki states (mx_wan1_status/mx_wan2_status) when available.- Correlation now rolls uplink truth up to
Device.oper_statefor MXs:- both WANs down/disabled ->
down - stale
meraki_last_reported_at(>24h) ->alerting - any WAN up ->
up - other partial/unknown WAN state ->
alerting
- both WANs down/disabled ->
- Device status history and API projections now prefer
oper_statebefore staticstatus, so UI and MCP consumers observe operational state rather than inventory-only state for MX devices.