Skip to content

Optimize boot time and consolidate confd into a single daemon #1412

Merged
troglobit merged 46 commits intomainfrom
initviz
Mar 20, 2026
Merged

Optimize boot time and consolidate confd into a single daemon #1412
troglobit merged 46 commits intomainfrom
initviz

Conversation

@troglobit
Copy link
Contributor

@troglobit troglobit commented Feb 22, 2026

Description

This PR replaces sysrepo-plugind + bootstrap + load scripts with a single confd that handles config generation, datastore init, config loading, and plugin management. The daemon uses a libev event loop with SR_SUBSCR_NO_THREAD instead of ~30 per-subscription sysrepo threads.

The new initviz package was used for boot time visualization.

Initial work on this branch speed up sysctl-sync-ip-conf and mnt scripts, as well as move hostname.d setup to build-time instead of runtime. To optimize the bootstrap process and free up CPU time on single-core systems, the start of many services have been postponed to either runlevel 2, or after confd bootstrap, like dbus, dnsmasq, and statd. To reduce overhead, we also drop unnecessary logger processes for services that already log to syslog. Even BusyBox has been inspected, we now enable NOEXEC/NOFORK applets.

Other interesting changes in this PR is to drop WiFi and GPS from minimal defconfigs. Move WiFi/firmware selects from RPi board Config.in to the full defconfigs so features are explicit rather than implicit via BSP.

Additional changes on this branch:

  • User-configurable HTTPS certificate: A new certificate node under services web lets users select any keystore certificate for the HTTPS management
    interface. The default auto-generated self-signed cert is now stored in the keystore (migration script included for existing devices).
  • confd crash recovery: A sentinel file on tmpfs lets confd distinguish a crash-restart from a fresh boot and skip destructive bootstrap phases (sysrepo SHM wipe, factory config reinstall) on restart, preserving a consistent datastore state across unexpected restarts.
  • NTP stratum-weight fix: The stratum-weight YANG leaf and its description had inverted semantics — 0.0 disables stratum weighting entirely (pure distance-based selection), not the other way around. The client_stratum_selection test now uses 1.0, which gives each stratum level a 1-second effective advantage and guarantees deterministic stratum-based selection on local networks. Fixes the long-standing flaky test (ntp: client_stratum_selection: Fails to select lowest stratum sometimes #1361).
  • Log noise reduction: Silenced spurious warnings from rousette (Worrying rousette warnings in log #892) and netopeer2-server (Worrying netopeer2-server warnings in log #1446). Transient mDNS daemon restart messages in statd are now suppressed unless the mDNS service is actually enabled in config. The "avahi" subsystem is renamed to "mdns" in all user-facing log messages.
  • CI: Fixed Coverity Scan build broken by the new avahi/mDNS dependency in statd (libavahi-client-dev was missing from the apt-get step).

Checklist

Tick relevant boxes, this PR is-a or has-a:

  • Bugfix
    • Regression tests
    • ChangeLog updates (for next release)
  • Feature
    • YANG model change => revision updated?
    • Regression tests added?
    • ChangeLog updates (for next release)
    • Documentation added?
  • Test changes
    • Checked in changed Readme.adoc (make test-spec)
    • Added new test to group Readme.adoc and yaml file
  • Code style update (formatting, renaming)
  • Refactoring (please detail in commit messages)
  • Build related changes
  • Documentation content changes
    • ChangeLog updated (for major changes)
  • Other (please describe):

@troglobit troglobit added the ci:main Build default defconfig, not minimal label Feb 22, 2026
@troglobit troglobit marked this pull request as ready for review February 22, 2026 11:24

This comment was marked as outdated.

@troglobit troglobit force-pushed the initviz branch 2 times, most recently from b8ae1be to 98772d7 Compare March 2, 2026 10:18
@troglobit troglobit force-pushed the initviz branch 2 times, most recently from fbe6691 to daab75e Compare March 17, 2026 13:49
@troglobit troglobit marked this pull request as draft March 17, 2026 13:49
@troglobit troglobit force-pushed the initviz branch 4 times, most recently from a529683 to b920f7f Compare March 20, 2026 05:58
@troglobit troglobit marked this pull request as ready for review March 20, 2026 06:14
@troglobit
Copy link
Contributor Author

troglobit commented Mar 20, 2026

@mattiaswal + @wkz this PR is now finally done and ready for review. Tests 100% PASS on all test systems, and I'm confident the flaky NTP stratum test is fixed for realz this time!

Copy link
Contributor

@mattiaswal mattiaswal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive work. This work is insane 🦸

with test.step("Verify both GPS receivers have a fix"):
until(lambda: gps.has_fix(target, "gps0"), attempts=60)
until(lambda: gps.has_fix(target, "gps1"), attempts=60)
until(lambda: gps.has_position(target, "gps0"), attempts=60)
Copy link
Contributor

@mattiaswal mattiaswal Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong, first check if it has fix to the satellites, add a new test step to verify that position exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see the commit message? The idea was not to verify the position but to fix a runtime error KeyError: 'altitude' I got in one test run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see the commit message? The idea was not to verify the position but to fix a runtime error KeyError: 'altitude' I got in one test run.

yes, i just want us to test the fix part also (which you removed) add gps.has_position just before verify_positon instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

 - The biggest changes are syncing with latest BusyBox (busybox-update-config)
 - Disable optimize for size
 - Enable feature "SH_NOFORK" which allows /bin/sh to call applet_main()
   directly without having to fork+exec busybox

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
The v2025.01 release supports the Microchip SamA7G5* eval kit(s), which
means we can enjoy the same patch level of U-Boot as other Infix boards

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
For details, see:
 - linux4sam/u-boot-at91@23ac019
 - linux4sam/linux-at91@5b35500

U-Boot patches imported and refreshed in local KernelKi fork of U-Boot,
see https://github.com/kernelkit/u-boot/tree/v2025.01-kkit

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Backport fixes from upstream post v4.16 release.  Mainly to fix
mdns-alias crash+restart counter issue when avahi-daemon has to
be restarted.  Finit did not properly clear the dependency that
mdns-alias had on avahi-daemon, causing it to crash and have its
restart counter incremented.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Use SR_SUBSCR_NO_THREAD for all subscriptions and integrate sysrepo
event pipes into a libev event loop.  This eliminates approximately 30
per-subscription threads, reducing overhead on embedded ARM hardware.

A temporary poll-based "event pump" thread handles callback dispatch
during bootstrap (where sr_replace_config blocks waiting for callbacks),
then exits.  After bootstrap, the single-threaded libev loop takes over
for steady-state event processing.

Note, the confd-test-mode plugin still requires use of threads so we do
not create deadlocks when calling sr_replace_config().

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Only install the keys on CHANGE event, fixes this annoying issue:

Nov 5 01:32:10 ix confd[2011]: Installing HTTPS gencert certificate "self-signed"
Nov 5 01:32:10 ix confd[2011]: Installing SSH host key "genkey".
Nov 5 01:32:11 ix confd[2011]: Installing HTTPS gencert certificate "self-signed
Nov 5 01:32:11 ix confd[2011]: Installing SSH host key "genkey".

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Replace logging + logging.handlers with a lightweight syslog wrapper,
and argparse with manual argv parsing.  On a sama7g54, this cuts yanger
startup from ~770ms to ~470ms by eliminating ~300ms of stdlib imports.

Also batch external command invocations:

 - ietf_routing: two sysctl calls instead of two per interface
 - ietf_hardware: one ls per hwmon device instead of six
 - bridge: fetch mctl querier data once instead of once per VLAN

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
 - Use same log frameworks as reset of confd
 - Use existing primitives from libite + libsrx
 - Drop remaining pthreads
 - Coding style fixes

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
When the mdns service is stop-started (e.g. after a config change),
statd's avahi client fires AVAHI_CLIENT_FAILURE momentarily.  With
AVAHI_CLIENT_NO_FAIL the client reconnects automatically, but the
immediate ERROR log is misleading:

  ERROR: avahi: client failure: Daemon connection failed

New behavior:
- On AVAHI_CLIENT_FAILURE: start a 2 s deferred timer (no immediate log)
- Timer fires up to 3 times (~6 s total); on the 3rd attempt, check if
  mDNS is enabled in the running config via a temporary sysrepo session
- Log ERROR only if the daemon is still down AND mDNS is enabled
- On AVAHI_CLIENT_S_RUNNING: cancel the timer, reset the counter, and
  log NOTE "mDNS daemon reconnected" if a failure had been seen

This silences the error entirely when the operator has disabled mDNS
(expected), and defers it by ~6 s for a brief restart (self-heals
before the timer fires).

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
The operator sees Infix through YANG models and should not need to know
which library implements a given feature.  Rename the public-facing
parts of the avahi module to use the mdns vocabulary:

- Log strings: "avahi: ..." → "mdns: ..."
- Public API:  avahi_ctx_init/exit → mdns_ctx_init/exit
- Main type:   struct avahi_ctx → struct mdns_ctx
- statd field: statd.avahi → statd.mdns

Internal types (struct avahi_neighbor, avahi_service, …) and file names
(avahi.c, avahi.h) are kept as-is — developers debugging at the C level
benefit from knowing the underlying implementation.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Save a few CPU cycles by skipping a new dagger generation when no
interfaces have been modified/added/deleted.

Uses d->next_fp as the sentinel: NULL means no claim was made for this
transaction.  dagger_evolve() and dagger_abandon() now NULL it after
fclose, so subsequent unclaimed transactions also get the clean early
return.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Previously svc_enadis() would call 'initctl enable + touch' for every
config change event, even when the service's enabled state was unchanged.
This caused rousette to be unnecessarily restarted on every test_reset(),
racing with the active RESTCONF connection on slow hardware.

Replace svc_enadis() and svc_change() with svc_enable() which only
manages nginx symlinks and calls 'initctl enable/disable' -- never touch.
Each handler now checks the diff for the specific leaf that changed:

- If /enabled appears in diff: call svc_enable() to start or stop it
- If other config leaves changed with service already enabled: touch only

This ensures rousette, ttyd, netbrowse, avahi, sshd, and lldpd are only
restarted when their configuration actually requires it.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Add a finit_enable/disable/reload() family in core.c that directly
manipulates Finit's service state without fork+exec overhead:

  finit_enable(svc)  -- create symlink in /etc/finit.d/enabled/
  finit_disable(svc) -- remove symlink from /etc/finit.d/enabled/
  finit_delete(svc)  -- remove both symlink and service entirely
  finit_reload(svc)  -- utimensat() on .conf to schedule reload

Printf-style variants (finit_enablef/disablef/reloadf) handle
template instance names such as container@foo and hostapd@wlan0.

All systemf("initctl ... enable/disable/touch ...") call sites across
containers, dhcp-server, firewall, hardware, ntp, routing, services,
syslog, and system are converted to the new API.

As a related cleanup in services.c, drop the remaining srx_enabled()
calls in favour of reading the already-fetched config tree directly
via lydx_is_enabled(lydx_get_xpathf(config, ...)), eliminating the
last sysrepo round-trips from that module.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Replace remaining systemf() calls that invoke simple file-system
operations with direct C API equivalents, eliminating unnecessary
fork/exec overhead:

 - mkdir -p     → mkpath() from libite
 - ln -sf       → erase() + symlink() from libite/POSIX
 - rm -rf       → rmrf() from libsrx helpers
 - rm -f dir/*  → rmrf() + mkpath() to clear and recreate the dir

Files updated: dagger.c, containers.c, firewall.c, services.c, system.c

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
After a successful bootstrap confd writes a sentinel to /run/confd.boot
If Finit restarts confd — whether after a crash or a clean exit — the
sentinel is found and the destructive bootstrap phases are skipped:

 - gen-config fork (factory/failure configs already exist)
 - wipe_sysrepo_shm() (other daemons, e.g. statd, are live)
 - sr_install_factory_config() (datastores are already initialised)
 - sr_replace_config(NULL, NULL) (running datastore is consistent)
 - bootstrap_config() / load startup-config (not needed; sysrepo has
   the right state; plugins resync via SR_EV_ENABLED on re-subscribe)

On restart confd reconnects to sysrepo, re-initialises plugins (which
re-subscribe and receive SR_EV_ENABLED to resync with the live running
datastore), then enters the steady-state event loop.

The sentinel lives on tmpfs so a real reboot always produces a clean
slate.  Crash-loop protection is delegated to Finit's max-restarts (10).

As a side-effect this also enables a future "run-once" mode for resource
constrained systems: confd can exit after bootstrap and the sentinel
ensures any later restart just re-attaches without re-bootstrapping.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
The setting 'stratumweight 0.0' disables stratum in Chrony source
selection (pure distance-based), making client_stratum_selection
non-deterministic on a LAN.  Setting it to 1.0 gives srv1 a 1-second
effective advantage per stratum level, which no realistic distance
fluctuation can overcome.

Also correct the YANG descriptions in infix-system and infix-ntp which
had the semantics backwards — claiming 0.0 "ensures lower stratum is
always preferred" when in fact higher values do that.

Fixes #1361

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Fixes #1446

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Add --log-level command line option to filter out log messages from
lower log levels.  Then fix the annoying CzechLight warning message
and useless "NACM config validation" log.  Then add audit trail as
we have in netopeer2-server, and finish off by stripping redundant
fields from log message: timestamp, identity, and log level.

Fixes #892

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
test_reset() triggers a config reload which causes services such as
rousette to restart.  Wait for the transport to become reachable again
before returning from attach(), preventing subsequent API calls from
racing with a still-restarting backend.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
A service restart triggered by finit during sysrepo callbacks can drop
an in-flight HTTP connection, causing copy("candidate", "running") to
fail with RemoteDisconnected.  Add retry logic consistent with the
existing PATCH retry pattern in put_config_dict().

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Replace fragile SSH+jq approach for reading the chassis MAC with a
NETCONF query to ietf-hardware:hardware/component[name='mainboard'],
reading phys-address directly from the YANG model.

Also add until() polling to the chassis MAC and chassis+offset MAC
verification steps, consistent with the reset-to-default steps.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
has_fix() only checks that fix-mode is set (2d/3d), but altitude and
other fields may not yet be populated in the operational datastore when
gpsd is still processing its first NMEA cycles after boot.  Calling
verify_position() immediately after has_fix() passes can therefore race
and fail with:

  KeyError: 'altitude'

This manifests reliably on the second GPS receiver (gps1) after reboot,
because it is initialized slightly later than gps0 and hits the window
where fix-mode is set but altitude has not yet appeared.

  not ok 11 - Verify gps1 position is near the coordinates
  # Traceback (most recent call last):
  #   File "test/case/hardware/gps_simple/test.py", line 29, in verify_position
  #     alt = float(state["altitude"])
  #                 ~~~~~^^^^^^^^^^^^
  # KeyError: 'altitude'

Add has_position() to infamy/gps.py, which gates on fix-mode AND all
position fields (latitude, longitude, altitude, satellites-used) being
present.  Replace the has_fix() polls in both the pre- and post-reboot
verify steps with has_position().

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
statd now uses avahi-client for mDNS neighbor tracking, so the
native build needs libavahi-client-dev to satisfy pkg-config.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Also, update testing-overview.svg to support dark mode view.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Copy link
Contributor

@mattiaswal mattiaswal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦸

@troglobit troglobit merged commit 15d6947 into main Mar 20, 2026
8 of 9 checks passed
@troglobit troglobit deleted the initviz branch March 20, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:main Build default defconfig, not minimal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants