support saving bin logs on server by tridge · Pull Request #36 · ArduPilot/SupportProxy

tridge · 2026-05-11T00:50:59Z

still working on the details, but basic support works

Reserves bit 3 of the existing flags word for the upcoming ArduPilot binary-log (.bin) recorder. No struct change — bin files reuse KeyEntry.tlog_retention_days (per spec, same retention rules as .tlog). keydb.py setflag/clearflag binlog work for free via FLAG_NAMES, and KeyEntry.__str__ already enumerates flag names so 'keydb.py list' shows 'binlog' alongside 'tlog'. Tests cover the bit's round-trip through pack/unpack, coexistence with FLAG_TLOG on one entry, and the CLI smoke path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move TlogWriter's per-day sessionN-scan out of tlog.cpp into a small shared helper next_session_n() in new session.h / session.cpp. The helper scans for BOTH session*.tlog AND session*.bin, so when the upcoming binlog feature lands, a single fork's .tlog and .bin files share the same N: per-child code computes the number once at fork start and passes it to TlogWriter::open() (now takes an explicit session_n; the internal scan is gone). The dir-creation helper mkpath() is also lifted into session.cpp so binlog can reuse it. No behaviour change for existing tlog deployments. Existing session1.tlog / session2.tlog naming is preserved; the new scan just happens to also notice .bin files (none of which exist yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When KEY_FLAG_BINLOG is set on an entry, the per-port-pair child runs an ArduPilot "dataflash over MAVLink" sink for the user-side link: * On the first REMOTE_LOG_DATA_BLOCK (msgid 184) from the user, open logs/<port2>/<YYYY-MM-DD>/sessionN.bin (lazy — an idle child never littered). N is the shared session_n from next_session_n() so .bin pairs with .tlog from the same fork. * Each block's 200 payload bytes are sparse-written at offset seqno * 200. Out-of-order blocks fill holes in-place; the file is always the size of the highest seen seqno + 1 blocks. * Each block gets a REMOTE_LOG_BLOCK_STATUS=ACK back through the user-side link. Forward jumps (seqno > highest_seen + 1) queue NACKs for the missing seqnos; tick() drains them, throttled to one NACK per missing seqno per 0.1s. We abandon a missing block after 60s wall time or once highest_seen is 200 ahead — both match MAVProxy's mavproxy_dataflash_logger.py. * Both REMOTE_LOG_* msgids (184, 185) are stripped from the user→engineer forward path so the support engineer's live session isn't polluted by log traffic. Engineer→user direction is left alone (per spec). A few supporting pieces: * MAVLink::send_buf() — new thin public wrapper around the private send_data() that lets binlog (and any future caller) push *already-finalised* bytes through this link's transport. Needed because pack_chan() finalises the message (trims a trailing zero status byte for NACKs, writes CRC bytes into the payload area at the trimmed offset); calling send_message() afterwards would re-finalise, re-trim, see the CRC byte as "live payload", and emit the CRC byte on the wire where the receiver expects status — silent payload corruption on every NACK. The new send_buf() skips that second finalize. * Parse-guard relaxation in main_loop: the user-side receive_message() was gated on conn2_count > 0; binlog also needs to parse user packets, so the guard becomes `conn2_count > 0 || binlog_enabled`. Tlog behaviour is unchanged (still requires an engineer; pre-existing limitation, not regressed). Tests in tests/test_binlog_capture.py drive a real proxy with a synthetic vehicle (raw socket emitting hand-built DATA_BLOCKs): - blocks land at the right offsets, ACKs come back - REMOTE_LOG_* not seen on engineer side - forced gap triggers a NACK for the missing seqno - sessionN.tlog and sessionN.bin share N when both flags set Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cleanup loop's per-file predicate (cleanup.cpp:26) becomes is_session_file() and accepts both '.tlog' and '.bin'. The retention rule (tlog_retention_days * 86400 by mtime) and the empty-date-dir prune both apply uniformly — per spec, one retention value covers both file types per entry. Webadmin's session-file regex (webadmin/tlogs.py:24) broadens to session\d+\.(tlog|bin). The listing pages and the send_from_directory download endpoint flow through this regex unchanged, so .bin files surface in the same per-date table and download through the same path-traversal-safe route. The private/no-store Cache-Control header from commit fc24fac applies to .bin too (same sensitivity as .tlog). Tests in tests/webadmin/test_tlog_routes.py: * owner + admin listings show .bin alongside .tlog * owner downloads a .bin successfully * .bin download carries Cache-Control: private, no-store * a session1.log / session1.pem in the same dir is still 404 (regex caps the extension) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FireVPS was logging repeated [CRITICAL] WORKER TIMEOUT followed by [ERROR] "Perhaps out of memory?" (gunicorn's stock SIGKILL line — not an actual OOM). Every traceback ended in _sslobj.read() on a TLS-wrapped socket; the public 8080 attracts a steady drip of scanners that complete a TLS handshake then disappear, and the default sync worker has no way to abandon a stalled connection — the arbiter eventually kills it for missing the 30s heartbeat. Switch to `-k gthread --threads 4 --timeout 60 --graceful-timeout 30`: the arbiter heartbeat now ticks from the worker's main thread regardless of what any individual request thread is blocked on, so a scanner can hold one thread without taking the whole worker down. -w 1 is still enough for the UI's load; threading is what unlocks concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KEY_FLAG_BINLOG was reachable only via 'keydb.py setflag <port2> binlog' — the original plan said the CLI was enough, but in practice owners and admins need a web UI affordance to turn .bin recording on/off without shelling into the server. Adds binlog_enabled (BooleanField) to OwnerEditForm and AdminEditForm with help text noting the firmware-side LOG_BACKEND_TYPE mavlink bit prerequisite. Routes set/clear FLAG_BINLOG on save (mirroring the existing FLAG_TLOG handling) and the first-enable default for retention now fires when either flag transitions from off to on. Templates render the new checkbox next to the existing tlog one in both owner.html and admin_edit.html. The retention label is updated to "Tlog + bin retention (days, …)" since the same tlog_retention_days field governs cleanup for both file types per spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A plain lexical sort puts session10.tlog between session1.tlog and session2.tlog, which is exactly when you most want to scan a listing visually (~10+ sessions in a day means something is churning). Split each filename into digit-runs + lowercased non- digit runs and use that as the key, so the rendered order is session1, session2, …, session9, session10, session11, session20. Applied to both _list_sessions (the per-date file list — primary target) and _list_dates (YYYY-MM-DD already sorts correctly under either scheme, but using the same helper keeps the call sites symmetric). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Production found a corrupt session22.bin on FireVPS: 611 MB apparent file size, only ~12 MB on disk, file starts with all zeros (an ArduPilot .bin must begin with FMT records), and mavlogdump.py refused to parse it. Root cause: a vehicle that was already logging over MAVLink when SupportProxy activated has a seqno counter already in the millions. We were obediently opening the file on the *first* DATA_BLOCK regardless of seqno and sparse-writing at offset seqno*200, so the resulting file is mostly holes with a few clusters of real data — and crucially has no FMT records at byte 0 so DFReader can't anchor. Gate file open on seqno == 0 (matches MAVProxy's mavproxy_dataflash_logger.py:344-348). Any DATA_BLOCK arriving before that is silently discarded — both the file open and the ACK side-effect. The vehicle restarting its log stream gives us a fresh seqno=0; until then we wait. This is intentionally strict for now; we may relax later (e.g. by sending REMOTE_LOG_BLOCK_STATUS=START to nudge the vehicle into restarting, or by accepting mid-stream seqnos when the user explicitly opts in) but the strict gate is the simplest correctness anchor. Tests in tests/test_binlog_capture.py: * test_strict_start_gate_discards_pre_seqno_0 — drives DATA_BLOCKs at seqno=1000..1003 through a fresh binlog-enabled child, asserts no .bin file is created and no ACKs are returned. Then sends seqno=0 + seqno=1 and asserts the file opens and is exactly 400 bytes (no sparse extension to whatever the earlier seqnos were). * test_bin_parses_with_dfreader — packs a minimal-but-valid ArduPilot bin payload (one FMT-of-FMT record padded to 200 bytes) into a DATA_BLOCK with seqno=0, drives it through the proxy, then opens the resulting sessionN.bin with pymavlink.DFReader.DFReader_binary and asserts the FMT record parses out. This is the test that would have caught the FireVPS corruption — without it we had no end-to-end "the file is actually a valid bin log" check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ArduPilot pre-arm rejected the vehicle on FireVPS with "logging failed" whenever LOG_BACKEND_TYPE included the mavlink backend (=3). Root cause: AP_Logger_MAVLink::logging_failed() returns !_sending_to_client, and that flag only flips to true when the vehicle receives a REMOTE_LOG_BLOCK_STATUS with status=ACK and seqno=MAV_REMOTE_LOG_DATA_BLOCK_START (2147483646) — see AP_Logger_MAVLink.cpp:240-251. We weren't sending it, so the vehicle thought no GCS-side logger was listening and refused to arm. (As a side-effect, the same START also resets the vehicle's seqno counter to 0, which our strict file-open gate already requires.) BinlogWriter::tick() gains a "before any DATA_BLOCK" phase: while binlog is enabled and any_block_seen is false, emit REMOTE_LOG_BLOCK_STATUS(seqno=START_MAGIC, status=ACK) at 1 Hz, throttled by last_start_sent_s. After the first DATA_BLOCK arrives we fall through to the existing ACK/NACK pump; the continuous ACK traffic from real blocks resets ArduPilot's 10-second _last_response_time client-timeout for us. target_system / target_component on the outgoing START are 0 (broadcast) — we haven't latched a vehicle sysid yet, and ArduPilot's handle_ack() picks the proxy's identity off the message header's src fields, not the body's target fields. The tick site in supportproxy.cpp now fires on `binlog_enabled && have_conn1` instead of `binlog.is_open()`, since the START phase predates the file open. Two tests in tests/test_binlog_capture.py: * test_proxy_sends_remote_log_start_when_idle drives HEARTBEATs at ~1 Hz (the way a real ArduPilot does — main_loop's select wakes on each one, giving tick a chance to fire) and asserts ≥ 2 STARTs arrive on the user-side socket in 3 s. * test_proxy_stops_sending_start_after_first_data_block sends seqno=0,1 right away, drains the ACKs, then watches for 2 s and asserts no further STARTs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MAX_ACKS_PER_TICK = 10 was carried over from MAVProxy's 100 Hz idle-loop pump, where 10/loop already delivers 1 kHz of throughput. Our main_loop wakes on each incoming packet rather than on a fixed schedule, and under a TCP burst one recv() can return ~45 MAVLink frames in a single call — the cap would let the ACK queue age 4+ ticks before catching up, prolonging the vehicle's "pending-block" pressure exactly when the vehicle is logging hardest. Drop the cap; drain the whole pending_acks queue per tick. Each ACK is a small UDP send; the cost difference is irrelevant next to the benefit of freeing the vehicle's queue faster. Field motivation: FireVPS DMS messages showed ~1 % message drop during ArduPilot pre-arm bursts (~0.25 % in level flight). The dominant cause is vehicle-side semaphore contention in _WritePrioritisedBlock — we can't eliminate that from the proxy, but the uncapped ACK pump is the one knob on our side that strictly helps. NACKs keep their MAX_NACKS_PER_TICK throttle so a wide gap can't flood the wire with status messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The retention field is shared by .tlog and .bin recordings (the cleanup loop already ages both), so the tlog-specific name was misleading. Rename the C struct field, the Python helper (set_tlog_retention -> set_log_retention), the default constant, and the __str__ output. The on-disk struct layout is unchanged: same 4-byte float at the same offset, so existing keys.tdb entries decode without migration. The FLAG_TLOG bit, the 'tlog' flag name, and tlog-specific tests keep their names — only the shared retention field is renamed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The hourly cleanup worker now ages out both .tlog and .bin recordings under the shared retention field, so the tlog-specific naming on the loop entry point and the printf strings was misleading. Rename tlog_cleanup_loop / tlog_cleanup_once and the surrounding log lines; the file predicate and per-entry behaviour are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The listing and download blueprint now serves both .tlog and .bin session files, and the retention form field is shared, so the tlog-specific URLs, blueprint names, template names, and form labels were misleading. Rename: webadmin/tlogs.py -> webadmin/logs.py /admin/tlogs/<port2>/... -> /admin/logs/<port2>/... /me/tlogs/... -> /me/logs/... admin_tlogs.html -> admin_logs.html owner_tlogs.html -> owner_logs.html form.tlog_retention_days -> form.log_retention_days blueprint admin_tlogs -> admin_logs blueprint owner_tlogs -> owner_logs The tlog_enabled / binlog_enabled checkboxes keep their names since they map 1:1 to the per-flag bits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tridge and others added 13 commits May 11, 2026 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support saving bin logs on server#36

support saving bin logs on server#36
tridge wants to merge 13 commits intoArduPilot:mainfrom
tridge:pr-bin-logs

tridge commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tridge commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tridge commented May 11, 2026 •

edited

Loading