Skip to content

support saving bin logs on server#36

Open
tridge wants to merge 13 commits intoArduPilot:mainfrom
tridge:pr-bin-logs
Open

support saving bin logs on server#36
tridge wants to merge 13 commits intoArduPilot:mainfrom
tridge:pr-bin-logs

Conversation

@tridge
Copy link
Copy Markdown
Contributor

@tridge tridge commented May 11, 2026

still working on the details, but basic support works
image

tridge and others added 13 commits May 11, 2026 06:35
Reserves bit 3 of the existing flags word for the upcoming ArduPilot
binary-log (.bin) recorder. No struct change — bin files reuse
KeyEntry.tlog_retention_days (per spec, same retention rules as
.tlog). keydb.py setflag/clearflag binlog work for free via
FLAG_NAMES, and KeyEntry.__str__ already enumerates flag names so
'keydb.py list' shows 'binlog' alongside 'tlog'.

Tests cover the bit's round-trip through pack/unpack, coexistence
with FLAG_TLOG on one entry, and the CLI smoke path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move TlogWriter's per-day sessionN-scan out of tlog.cpp into a small
shared helper next_session_n() in new session.h / session.cpp. The
helper scans for BOTH session*.tlog AND session*.bin, so when the
upcoming binlog feature lands, a single fork's .tlog and .bin files
share the same N: per-child code computes the number once at fork
start and passes it to TlogWriter::open() (now takes an explicit
session_n; the internal scan is gone). The dir-creation helper
mkpath() is also lifted into session.cpp so binlog can reuse it.

No behaviour change for existing tlog deployments. Existing
session1.tlog / session2.tlog naming is preserved; the new scan
just happens to also notice .bin files (none of which exist yet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When KEY_FLAG_BINLOG is set on an entry, the per-port-pair child
runs an ArduPilot "dataflash over MAVLink" sink for the user-side
link:

  * On the first REMOTE_LOG_DATA_BLOCK (msgid 184) from the user,
    open logs/<port2>/<YYYY-MM-DD>/sessionN.bin (lazy — an idle
    child never littered). N is the shared session_n from
    next_session_n() so .bin pairs with .tlog from the same fork.
  * Each block's 200 payload bytes are sparse-written at offset
    seqno * 200. Out-of-order blocks fill holes in-place; the file
    is always the size of the highest seen seqno + 1 blocks.
  * Each block gets a REMOTE_LOG_BLOCK_STATUS=ACK back through the
    user-side link. Forward jumps (seqno > highest_seen + 1) queue
    NACKs for the missing seqnos; tick() drains them, throttled to
    one NACK per missing seqno per 0.1s. We abandon a missing block
    after 60s wall time or once highest_seen is 200 ahead — both
    match MAVProxy's mavproxy_dataflash_logger.py.
  * Both REMOTE_LOG_* msgids (184, 185) are stripped from the
    user→engineer forward path so the support engineer's live
    session isn't polluted by log traffic. Engineer→user direction
    is left alone (per spec).

A few supporting pieces:

  * MAVLink::send_buf() — new thin public wrapper around the
    private send_data() that lets binlog (and any future caller)
    push *already-finalised* bytes through this link's transport.
    Needed because pack_chan() finalises the message (trims a
    trailing zero status byte for NACKs, writes CRC bytes into the
    payload area at the trimmed offset); calling send_message()
    afterwards would re-finalise, re-trim, see the CRC byte as
    "live payload", and emit the CRC byte on the wire where the
    receiver expects status — silent payload corruption on every
    NACK. The new send_buf() skips that second finalize.
  * Parse-guard relaxation in main_loop: the user-side
    receive_message() was gated on conn2_count > 0; binlog also
    needs to parse user packets, so the guard becomes
    `conn2_count > 0 || binlog_enabled`. Tlog behaviour is
    unchanged (still requires an engineer; pre-existing
    limitation, not regressed).

Tests in tests/test_binlog_capture.py drive a real proxy with a
synthetic vehicle (raw socket emitting hand-built DATA_BLOCKs):
  - blocks land at the right offsets, ACKs come back
  - REMOTE_LOG_* not seen on engineer side
  - forced gap triggers a NACK for the missing seqno
  - sessionN.tlog and sessionN.bin share N when both flags set

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup loop's per-file predicate (cleanup.cpp:26) becomes
is_session_file() and accepts both '.tlog' and '.bin'. The
retention rule (tlog_retention_days * 86400 by mtime) and the
empty-date-dir prune both apply uniformly — per spec, one
retention value covers both file types per entry.

Webadmin's session-file regex (webadmin/tlogs.py:24) broadens to
session\d+\.(tlog|bin). The listing pages and the
send_from_directory download endpoint flow through this regex
unchanged, so .bin files surface in the same per-date table and
download through the same path-traversal-safe route. The
private/no-store Cache-Control header from commit fc24fac
applies to .bin too (same sensitivity as .tlog).

Tests in tests/webadmin/test_tlog_routes.py:
  * owner + admin listings show .bin alongside .tlog
  * owner downloads a .bin successfully
  * .bin download carries Cache-Control: private, no-store
  * a session1.log / session1.pem in the same dir is still 404
    (regex caps the extension)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FireVPS was logging repeated [CRITICAL] WORKER TIMEOUT followed by
[ERROR] "Perhaps out of memory?" (gunicorn's stock SIGKILL line —
not an actual OOM). Every traceback ended in _sslobj.read() on a
TLS-wrapped socket; the public 8080 attracts a steady drip of
scanners that complete a TLS handshake then disappear, and the
default sync worker has no way to abandon a stalled connection —
the arbiter eventually kills it for missing the 30s heartbeat.

Switch to `-k gthread --threads 4 --timeout 60 --graceful-timeout 30`:
the arbiter heartbeat now ticks from the worker's main thread
regardless of what any individual request thread is blocked on, so
a scanner can hold one thread without taking the whole worker
down. -w 1 is still enough for the UI's load; threading is what
unlocks concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KEY_FLAG_BINLOG was reachable only via 'keydb.py setflag <port2>
binlog' — the original plan said the CLI was enough, but in
practice owners and admins need a web UI affordance to turn .bin
recording on/off without shelling into the server.

Adds binlog_enabled (BooleanField) to OwnerEditForm and
AdminEditForm with help text noting the firmware-side
LOG_BACKEND_TYPE mavlink bit prerequisite. Routes set/clear
FLAG_BINLOG on save (mirroring the existing FLAG_TLOG handling)
and the first-enable default for retention now fires when either
flag transitions from off to on. Templates render the new
checkbox next to the existing tlog one in both owner.html and
admin_edit.html.

The retention label is updated to "Tlog + bin retention (days, …)"
since the same tlog_retention_days field governs cleanup for both
file types per spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A plain lexical sort puts session10.tlog between session1.tlog and
session2.tlog, which is exactly when you most want to scan a
listing visually (~10+ sessions in a day means something is
churning). Split each filename into digit-runs + lowercased non-
digit runs and use that as the key, so the rendered order is
session1, session2, …, session9, session10, session11, session20.

Applied to both _list_sessions (the per-date file list — primary
target) and _list_dates (YYYY-MM-DD already sorts correctly under
either scheme, but using the same helper keeps the call sites
symmetric).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production found a corrupt session22.bin on FireVPS: 611 MB apparent
file size, only ~12 MB on disk, file starts with all zeros (an
ArduPilot .bin must begin with FMT records), and mavlogdump.py
refused to parse it. Root cause: a vehicle that was already logging
over MAVLink when SupportProxy activated has a seqno counter
already in the millions. We were obediently opening the file on
the *first* DATA_BLOCK regardless of seqno and sparse-writing at
offset seqno*200, so the resulting file is mostly holes with a
few clusters of real data — and crucially has no FMT records at
byte 0 so DFReader can't anchor.

Gate file open on seqno == 0 (matches MAVProxy's
mavproxy_dataflash_logger.py:344-348). Any DATA_BLOCK arriving
before that is silently discarded — both the file open and the
ACK side-effect. The vehicle restarting its log stream gives us a
fresh seqno=0; until then we wait.

This is intentionally strict for now; we may relax later (e.g. by
sending REMOTE_LOG_BLOCK_STATUS=START to nudge the vehicle into
restarting, or by accepting mid-stream seqnos when the user
explicitly opts in) but the strict gate is the simplest
correctness anchor.

Tests in tests/test_binlog_capture.py:

* test_strict_start_gate_discards_pre_seqno_0 — drives DATA_BLOCKs
  at seqno=1000..1003 through a fresh binlog-enabled child, asserts
  no .bin file is created and no ACKs are returned. Then sends
  seqno=0 + seqno=1 and asserts the file opens and is exactly 400
  bytes (no sparse extension to whatever the earlier seqnos were).

* test_bin_parses_with_dfreader — packs a minimal-but-valid
  ArduPilot bin payload (one FMT-of-FMT record padded to 200 bytes)
  into a DATA_BLOCK with seqno=0, drives it through the proxy, then
  opens the resulting sessionN.bin with
  pymavlink.DFReader.DFReader_binary and asserts the FMT record
  parses out. This is the test that would have caught the FireVPS
  corruption — without it we had no end-to-end "the file is
  actually a valid bin log" check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArduPilot pre-arm rejected the vehicle on FireVPS with
"logging failed" whenever LOG_BACKEND_TYPE included the mavlink
backend (=3). Root cause: AP_Logger_MAVLink::logging_failed()
returns !_sending_to_client, and that flag only flips to true
when the vehicle receives a REMOTE_LOG_BLOCK_STATUS with status=ACK
and seqno=MAV_REMOTE_LOG_DATA_BLOCK_START (2147483646)
— see AP_Logger_MAVLink.cpp:240-251. We weren't sending it, so
the vehicle thought no GCS-side logger was listening and refused
to arm. (As a side-effect, the same START also resets the
vehicle's seqno counter to 0, which our strict file-open gate
already requires.)

BinlogWriter::tick() gains a "before any DATA_BLOCK" phase: while
binlog is enabled and any_block_seen is false, emit
REMOTE_LOG_BLOCK_STATUS(seqno=START_MAGIC, status=ACK) at 1 Hz,
throttled by last_start_sent_s. After the first DATA_BLOCK arrives
we fall through to the existing ACK/NACK pump; the continuous ACK
traffic from real blocks resets ArduPilot's 10-second
_last_response_time client-timeout for us.

target_system / target_component on the outgoing START are 0
(broadcast) — we haven't latched a vehicle sysid yet, and
ArduPilot's handle_ack() picks the proxy's identity off the
message header's src fields, not the body's target fields.

The tick site in supportproxy.cpp now fires on
`binlog_enabled && have_conn1` instead of `binlog.is_open()`,
since the START phase predates the file open.

Two tests in tests/test_binlog_capture.py:

* test_proxy_sends_remote_log_start_when_idle drives HEARTBEATs at
  ~1 Hz (the way a real ArduPilot does — main_loop's select wakes
  on each one, giving tick a chance to fire) and asserts ≥ 2 STARTs
  arrive on the user-side socket in 3 s.

* test_proxy_stops_sending_start_after_first_data_block sends
  seqno=0,1 right away, drains the ACKs, then watches for 2 s and
  asserts no further STARTs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MAX_ACKS_PER_TICK = 10 was carried over from MAVProxy's 100 Hz
idle-loop pump, where 10/loop already delivers 1 kHz of throughput.
Our main_loop wakes on each incoming packet rather than on a fixed
schedule, and under a TCP burst one recv() can return ~45 MAVLink
frames in a single call — the cap would let the ACK queue age 4+
ticks before catching up, prolonging the vehicle's
"pending-block" pressure exactly when the vehicle is logging
hardest.

Drop the cap; drain the whole pending_acks queue per tick. Each
ACK is a small UDP send; the cost difference is irrelevant next to
the benefit of freeing the vehicle's queue faster.

Field motivation: FireVPS DMS messages showed ~1 % message drop
during ArduPilot pre-arm bursts (~0.25 % in level flight). The
dominant cause is vehicle-side semaphore contention in
_WritePrioritisedBlock — we can't eliminate that from the proxy,
but the uncapped ACK pump is the one knob on our side that
strictly helps.

NACKs keep their MAX_NACKS_PER_TICK throttle so a wide gap can't
flood the wire with status messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The retention field is shared by .tlog and .bin recordings (the cleanup
loop already ages both), so the tlog-specific name was misleading. Rename
the C struct field, the Python helper (set_tlog_retention -> set_log_retention),
the default constant, and the __str__ output. The on-disk struct layout
is unchanged: same 4-byte float at the same offset, so existing keys.tdb
entries decode without migration.

The FLAG_TLOG bit, the 'tlog' flag name, and tlog-specific tests keep
their names — only the shared retention field is renamed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hourly cleanup worker now ages out both .tlog and .bin recordings
under the shared retention field, so the tlog-specific naming on the
loop entry point and the printf strings was misleading. Rename
tlog_cleanup_loop / tlog_cleanup_once and the surrounding log lines;
the file predicate and per-entry behaviour are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The listing and download blueprint now serves both .tlog and .bin
session files, and the retention form field is shared, so the
tlog-specific URLs, blueprint names, template names, and form labels
were misleading. Rename:

  webadmin/tlogs.py        -> webadmin/logs.py
  /admin/tlogs/<port2>/... -> /admin/logs/<port2>/...
  /me/tlogs/...            -> /me/logs/...
  admin_tlogs.html         -> admin_logs.html
  owner_tlogs.html         -> owner_logs.html
  form.tlog_retention_days -> form.log_retention_days
  blueprint admin_tlogs    -> admin_logs
  blueprint owner_tlogs    -> owner_logs

The tlog_enabled / binlog_enabled checkboxes keep their names since
they map 1:1 to the per-flag bits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant