support saving bin logs on server#36
Open
tridge wants to merge 13 commits intoArduPilot:mainfrom
Open
Conversation
Reserves bit 3 of the existing flags word for the upcoming ArduPilot binary-log (.bin) recorder. No struct change — bin files reuse KeyEntry.tlog_retention_days (per spec, same retention rules as .tlog). keydb.py setflag/clearflag binlog work for free via FLAG_NAMES, and KeyEntry.__str__ already enumerates flag names so 'keydb.py list' shows 'binlog' alongside 'tlog'. Tests cover the bit's round-trip through pack/unpack, coexistence with FLAG_TLOG on one entry, and the CLI smoke path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move TlogWriter's per-day sessionN-scan out of tlog.cpp into a small shared helper next_session_n() in new session.h / session.cpp. The helper scans for BOTH session*.tlog AND session*.bin, so when the upcoming binlog feature lands, a single fork's .tlog and .bin files share the same N: per-child code computes the number once at fork start and passes it to TlogWriter::open() (now takes an explicit session_n; the internal scan is gone). The dir-creation helper mkpath() is also lifted into session.cpp so binlog can reuse it. No behaviour change for existing tlog deployments. Existing session1.tlog / session2.tlog naming is preserved; the new scan just happens to also notice .bin files (none of which exist yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When KEY_FLAG_BINLOG is set on an entry, the per-port-pair child
runs an ArduPilot "dataflash over MAVLink" sink for the user-side
link:
* On the first REMOTE_LOG_DATA_BLOCK (msgid 184) from the user,
open logs/<port2>/<YYYY-MM-DD>/sessionN.bin (lazy — an idle
child never littered). N is the shared session_n from
next_session_n() so .bin pairs with .tlog from the same fork.
* Each block's 200 payload bytes are sparse-written at offset
seqno * 200. Out-of-order blocks fill holes in-place; the file
is always the size of the highest seen seqno + 1 blocks.
* Each block gets a REMOTE_LOG_BLOCK_STATUS=ACK back through the
user-side link. Forward jumps (seqno > highest_seen + 1) queue
NACKs for the missing seqnos; tick() drains them, throttled to
one NACK per missing seqno per 0.1s. We abandon a missing block
after 60s wall time or once highest_seen is 200 ahead — both
match MAVProxy's mavproxy_dataflash_logger.py.
* Both REMOTE_LOG_* msgids (184, 185) are stripped from the
user→engineer forward path so the support engineer's live
session isn't polluted by log traffic. Engineer→user direction
is left alone (per spec).
A few supporting pieces:
* MAVLink::send_buf() — new thin public wrapper around the
private send_data() that lets binlog (and any future caller)
push *already-finalised* bytes through this link's transport.
Needed because pack_chan() finalises the message (trims a
trailing zero status byte for NACKs, writes CRC bytes into the
payload area at the trimmed offset); calling send_message()
afterwards would re-finalise, re-trim, see the CRC byte as
"live payload", and emit the CRC byte on the wire where the
receiver expects status — silent payload corruption on every
NACK. The new send_buf() skips that second finalize.
* Parse-guard relaxation in main_loop: the user-side
receive_message() was gated on conn2_count > 0; binlog also
needs to parse user packets, so the guard becomes
`conn2_count > 0 || binlog_enabled`. Tlog behaviour is
unchanged (still requires an engineer; pre-existing
limitation, not regressed).
Tests in tests/test_binlog_capture.py drive a real proxy with a
synthetic vehicle (raw socket emitting hand-built DATA_BLOCKs):
- blocks land at the right offsets, ACKs come back
- REMOTE_LOG_* not seen on engineer side
- forced gap triggers a NACK for the missing seqno
- sessionN.tlog and sessionN.bin share N when both flags set
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup loop's per-file predicate (cleanup.cpp:26) becomes is_session_file() and accepts both '.tlog' and '.bin'. The retention rule (tlog_retention_days * 86400 by mtime) and the empty-date-dir prune both apply uniformly — per spec, one retention value covers both file types per entry. Webadmin's session-file regex (webadmin/tlogs.py:24) broadens to session\d+\.(tlog|bin). The listing pages and the send_from_directory download endpoint flow through this regex unchanged, so .bin files surface in the same per-date table and download through the same path-traversal-safe route. The private/no-store Cache-Control header from commit fc24fac applies to .bin too (same sensitivity as .tlog). Tests in tests/webadmin/test_tlog_routes.py: * owner + admin listings show .bin alongside .tlog * owner downloads a .bin successfully * .bin download carries Cache-Control: private, no-store * a session1.log / session1.pem in the same dir is still 404 (regex caps the extension) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FireVPS was logging repeated [CRITICAL] WORKER TIMEOUT followed by [ERROR] "Perhaps out of memory?" (gunicorn's stock SIGKILL line — not an actual OOM). Every traceback ended in _sslobj.read() on a TLS-wrapped socket; the public 8080 attracts a steady drip of scanners that complete a TLS handshake then disappear, and the default sync worker has no way to abandon a stalled connection — the arbiter eventually kills it for missing the 30s heartbeat. Switch to `-k gthread --threads 4 --timeout 60 --graceful-timeout 30`: the arbiter heartbeat now ticks from the worker's main thread regardless of what any individual request thread is blocked on, so a scanner can hold one thread without taking the whole worker down. -w 1 is still enough for the UI's load; threading is what unlocks concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KEY_FLAG_BINLOG was reachable only via 'keydb.py setflag <port2> binlog' — the original plan said the CLI was enough, but in practice owners and admins need a web UI affordance to turn .bin recording on/off without shelling into the server. Adds binlog_enabled (BooleanField) to OwnerEditForm and AdminEditForm with help text noting the firmware-side LOG_BACKEND_TYPE mavlink bit prerequisite. Routes set/clear FLAG_BINLOG on save (mirroring the existing FLAG_TLOG handling) and the first-enable default for retention now fires when either flag transitions from off to on. Templates render the new checkbox next to the existing tlog one in both owner.html and admin_edit.html. The retention label is updated to "Tlog + bin retention (days, …)" since the same tlog_retention_days field governs cleanup for both file types per spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A plain lexical sort puts session10.tlog between session1.tlog and session2.tlog, which is exactly when you most want to scan a listing visually (~10+ sessions in a day means something is churning). Split each filename into digit-runs + lowercased non- digit runs and use that as the key, so the rendered order is session1, session2, …, session9, session10, session11, session20. Applied to both _list_sessions (the per-date file list — primary target) and _list_dates (YYYY-MM-DD already sorts correctly under either scheme, but using the same helper keeps the call sites symmetric). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production found a corrupt session22.bin on FireVPS: 611 MB apparent file size, only ~12 MB on disk, file starts with all zeros (an ArduPilot .bin must begin with FMT records), and mavlogdump.py refused to parse it. Root cause: a vehicle that was already logging over MAVLink when SupportProxy activated has a seqno counter already in the millions. We were obediently opening the file on the *first* DATA_BLOCK regardless of seqno and sparse-writing at offset seqno*200, so the resulting file is mostly holes with a few clusters of real data — and crucially has no FMT records at byte 0 so DFReader can't anchor. Gate file open on seqno == 0 (matches MAVProxy's mavproxy_dataflash_logger.py:344-348). Any DATA_BLOCK arriving before that is silently discarded — both the file open and the ACK side-effect. The vehicle restarting its log stream gives us a fresh seqno=0; until then we wait. This is intentionally strict for now; we may relax later (e.g. by sending REMOTE_LOG_BLOCK_STATUS=START to nudge the vehicle into restarting, or by accepting mid-stream seqnos when the user explicitly opts in) but the strict gate is the simplest correctness anchor. Tests in tests/test_binlog_capture.py: * test_strict_start_gate_discards_pre_seqno_0 — drives DATA_BLOCKs at seqno=1000..1003 through a fresh binlog-enabled child, asserts no .bin file is created and no ACKs are returned. Then sends seqno=0 + seqno=1 and asserts the file opens and is exactly 400 bytes (no sparse extension to whatever the earlier seqnos were). * test_bin_parses_with_dfreader — packs a minimal-but-valid ArduPilot bin payload (one FMT-of-FMT record padded to 200 bytes) into a DATA_BLOCK with seqno=0, drives it through the proxy, then opens the resulting sessionN.bin with pymavlink.DFReader.DFReader_binary and asserts the FMT record parses out. This is the test that would have caught the FireVPS corruption — without it we had no end-to-end "the file is actually a valid bin log" check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArduPilot pre-arm rejected the vehicle on FireVPS with "logging failed" whenever LOG_BACKEND_TYPE included the mavlink backend (=3). Root cause: AP_Logger_MAVLink::logging_failed() returns !_sending_to_client, and that flag only flips to true when the vehicle receives a REMOTE_LOG_BLOCK_STATUS with status=ACK and seqno=MAV_REMOTE_LOG_DATA_BLOCK_START (2147483646) — see AP_Logger_MAVLink.cpp:240-251. We weren't sending it, so the vehicle thought no GCS-side logger was listening and refused to arm. (As a side-effect, the same START also resets the vehicle's seqno counter to 0, which our strict file-open gate already requires.) BinlogWriter::tick() gains a "before any DATA_BLOCK" phase: while binlog is enabled and any_block_seen is false, emit REMOTE_LOG_BLOCK_STATUS(seqno=START_MAGIC, status=ACK) at 1 Hz, throttled by last_start_sent_s. After the first DATA_BLOCK arrives we fall through to the existing ACK/NACK pump; the continuous ACK traffic from real blocks resets ArduPilot's 10-second _last_response_time client-timeout for us. target_system / target_component on the outgoing START are 0 (broadcast) — we haven't latched a vehicle sysid yet, and ArduPilot's handle_ack() picks the proxy's identity off the message header's src fields, not the body's target fields. The tick site in supportproxy.cpp now fires on `binlog_enabled && have_conn1` instead of `binlog.is_open()`, since the START phase predates the file open. Two tests in tests/test_binlog_capture.py: * test_proxy_sends_remote_log_start_when_idle drives HEARTBEATs at ~1 Hz (the way a real ArduPilot does — main_loop's select wakes on each one, giving tick a chance to fire) and asserts ≥ 2 STARTs arrive on the user-side socket in 3 s. * test_proxy_stops_sending_start_after_first_data_block sends seqno=0,1 right away, drains the ACKs, then watches for 2 s and asserts no further STARTs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MAX_ACKS_PER_TICK = 10 was carried over from MAVProxy's 100 Hz idle-loop pump, where 10/loop already delivers 1 kHz of throughput. Our main_loop wakes on each incoming packet rather than on a fixed schedule, and under a TCP burst one recv() can return ~45 MAVLink frames in a single call — the cap would let the ACK queue age 4+ ticks before catching up, prolonging the vehicle's "pending-block" pressure exactly when the vehicle is logging hardest. Drop the cap; drain the whole pending_acks queue per tick. Each ACK is a small UDP send; the cost difference is irrelevant next to the benefit of freeing the vehicle's queue faster. Field motivation: FireVPS DMS messages showed ~1 % message drop during ArduPilot pre-arm bursts (~0.25 % in level flight). The dominant cause is vehicle-side semaphore contention in _WritePrioritisedBlock — we can't eliminate that from the proxy, but the uncapped ACK pump is the one knob on our side that strictly helps. NACKs keep their MAX_NACKS_PER_TICK throttle so a wide gap can't flood the wire with status messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The retention field is shared by .tlog and .bin recordings (the cleanup loop already ages both), so the tlog-specific name was misleading. Rename the C struct field, the Python helper (set_tlog_retention -> set_log_retention), the default constant, and the __str__ output. The on-disk struct layout is unchanged: same 4-byte float at the same offset, so existing keys.tdb entries decode without migration. The FLAG_TLOG bit, the 'tlog' flag name, and tlog-specific tests keep their names — only the shared retention field is renamed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hourly cleanup worker now ages out both .tlog and .bin recordings under the shared retention field, so the tlog-specific naming on the loop entry point and the printf strings was misleading. Rename tlog_cleanup_loop / tlog_cleanup_once and the surrounding log lines; the file predicate and per-entry behaviour are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The listing and download blueprint now serves both .tlog and .bin session files, and the retention form field is shared, so the tlog-specific URLs, blueprint names, template names, and form labels were misleading. Rename: webadmin/tlogs.py -> webadmin/logs.py /admin/tlogs/<port2>/... -> /admin/logs/<port2>/... /me/tlogs/... -> /me/logs/... admin_tlogs.html -> admin_logs.html owner_tlogs.html -> owner_logs.html form.tlog_retention_days -> form.log_retention_days blueprint admin_tlogs -> admin_logs blueprint owner_tlogs -> owner_logs The tlog_enabled / binlog_enabled checkboxes keep their names since they map 1:1 to the per-flag bits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
still working on the details, but basic support works
