Skip to content

Native SMS inbox backend (read/unread, ISO 8601, multi-part, UCS-2)#22

Open
dr-dolomite wants to merge 28 commits into
development-homefrom
feature/sms-native-backend
Open

Native SMS inbox backend (read/unread, ISO 8601, multi-part, UCS-2)#22
dr-dolomite wants to merge 28 commits into
development-homefrom
feature/sms-native-backend

Conversation

@dr-dolomite
Copy link
Copy Markdown
Owner

Summary

Replaces the SMS inbox read path with a native shell + awk pipeline (no sms_tool binary for reads), keeping the backend localized. Adds the inbox features requested: read/unread status, newest-first sort, and correct multi-part + non-Latin/emoji decoding.

  • PDU codec (scripts/usr/lib/qmanager/sms_pdu.awk): GSM-7 (incl. extension table), UCS-2 BE → UTF-8 with surrogate pairs (U+FFFD-hardened), UDH concat parsing (reference/part/total), ISO 8601 SCTS with TZ offset, read/unread status from the CMGL stat byte, and a batched decode_list op (one awk invocation per request, not per message).
  • CGI (scripts/www/cgi-bin/quecmanager/cellular/sms.sh): inbox GET now runs qcmd AT+CMGF=0 + AT+CMGL=4 piped through the codec — locking inherited from qcmd's flock (no self-lock); CR-hardened + +CME/+CMS ERROR guarded parser. Merge carries status (unread-if-any-part) and default-sorts newest-first by timestamp. New mark_read POST action (AT+CMGR=<idx>,0). Fixed the long-standing storage 0/0 bug (now AT+CPMS?).
  • Contract (types/sms.ts): status: "read" | "unread" required; timestamp documented as ISO 8601.

Scope / out of scope

  • send / delete / delete_all still use sms_tool (intentionally — separate future plan).
  • Frontend filter/sort/mark-read UI wiring is a separate follow-up.
  • Tracked follow-up: harden delete + mark_read with numeric-index validation + consistent temp-file naming.

Test Plan

  • Codec unit suite: bash tests/sms_pdu_test.shpassed=13 failed=0 (13 fixtures: GSM-7, UCS-2 BMP + emoji surrogate, UDH concat, lone-surrogate, ISO 8601 TZ, batched decode_list, no-stale-concat invariant)
  • sh -n clean on sms.sh; bunx tsc --noEmit adds zero new errors
  • Live-device parity (RM551E, BusyBox awk): content byte-clean vs the pre-migration sms_tool baseline across all 51 merged messages; BusyBox sprintf("%c") UTF-8 emission confirmed; storage fixed to 73/255; newest-first confirmed; mark_read plumbing verified end-to-end; lock coordination clean. Full record in tests/fixtures/sms/parity/README.md.
  • Re-verify a live unread→read flip when an unread message is present (device had 0 unread at verification time)
  • Frontend wiring (separate plan)

🤖 Generated with Claude Code

Frozen snapshots of the current sms_tool-based pipeline taken from the live
test modem (RM551EGL, 2026-05-28) for use as diff targets in Task 17 (Phase 1)
and Task 32 (final) parity verification during the native QMI SMS migration.
- Add trailing newline to baseline.status.txt for cleaner Task 17 diffs
- Expand README with concrete jq diff commands for storage field
- Note sibling directory layout for decode/encode codec fixtures
Implements Task 2 of the SMS native backend plan.

Codec changes (scripts/usr/lib/qmanager/sms_pdu.awk):
- hex2dec(), swap_pair(), decode_digits() — hex and semi-octet helpers
- decode_scts() — 7-octet timestamp → "MM/DD/YY HH:MM:SS"
- decode_gsm7_address() — stub returning [ALPHA_TODO]; full GSM-7 in Task 3
- pop() — consumes N hex chars from front of pdu state string
- do_decode_one() — main driver: skips SMSC, reads TPDU first octet (UDHI/MTI),
  decodes OA length + TON + digit/alpha address, PID, DCS, SCTS, UDL,
  stubs content as raw UD hex remainder
- json_str() — JSON string escaper
- Driver block wires op=decode_one to read hex lines and call do_decode_one()

Fixture 02 (tests/fixtures/sms/decode/02_header_only.*):
- PDU captured from live device (SM storage index 10, AT+CMGR=10)
- Sender: 639686973969 (TON 0x91 international E.164 — different OA path from
  fixture 01's TON 0xD0 alphanumeric sender)
- Message is multi-part (UDHI=1); Task 5 will parse UDH; for now content = raw UD hex
- Expected JSON uses "index":0 (no -v idx= passed; real propagation in Task 6)
- Fixture 02 PASSES fully with this implementation

Test runner hardening (tests/sms_pdu_test.sh) — carry-forward from Task 1 review:
- Changed shebang from #!/bin/sh to #!/usr/bin/env bash (file uses bash extensions:
  set -eu, process substitution <(...))
- Added jq and awk preflight guards (fail fast with clear message when missing)

Known partial-pass state:
- Fixture 01 FAILS on three fields: sender ([ALPHA_TODO] vs SmartApp, Task 3),
  content (raw hex vs decoded GSM-7 text, Task 3), index (0 vs 30, Task 6)
- Timestamp for fixture 01 matches correctly
- Both failures resolve in their respective tasks
Device BusyBox awk bitwise capability check (2026-05-28):
  echo "" | awk 'BEGIN{print and(0xF0,0x0F),or(1,2),lshift(1,4),rshift(16,2)}'
  Output: 0 3 16 4  — built-in ops confirmed present.
Using arithmetic-only wrappers (and_int/or_int/lshift_int/rshift_int) regardless,
for portability across BusyBox awk builds and GAWK on the test host.

Changes:
- sms_pdu.awk: add gsm7_init() (direct array assignments — avoids the split()
  comma-separator collision at septet 44), gsm7_unpack() with UDH skip-alignment,
  arithmetic bitwise helpers, and decode_gsm7_address() for TON 0xD0 senders.
  Wire DCS dispatch in do_decode_one(): alphabet=0→GSM-7, 2→raw hex (Task 4),
  else raw hex. All new locals properly scoped; bytes[] is a local awk array.
- fixture 03: synthetic single-part GSM-7 PDU (no live single-part messages
  remain in inbox; all available slots are multipart concat fragments). Sender
  +1234567890, text "Hello from QManager! This is a GSM-7 test message."
- fixture 04: synthetic GSM-7 PDU with extension-table characters [ ] ^.
  Sender +9876543210, text "GSM7 ext: [brackets] and ^caret^."
- fixture 01 expected: update index 0→0 (was 30), sender decoded to "SmartApp",
  content matches real GSM-7 decode. PASS.
- fixture 02 expected: DCS=0x00 so decoder now unpacks GSM-7 body with UDH
  skip-alignment (udhl=5 bytes → skip 7 septets). Content is Lorem Ipsum
  test message. Task 5 will update once UDH concat fields are parsed. PASS.

All 4 fixtures pass: passed=4 failed=0.
Status comes from the +CMGL header line, passed in via -v stat=N (0=unread,
default 1=read). Emitted as JSON string "unread"|"read" so the CGI merge and
UI can consume it directly.
Replaces sms_tool's "MM/DD/YY HH:MM:SS" format with "YYYY-MM-DDTHH:MM:SS±HH:MM"
so the CGI can sort lexicographically (= chronologically) and the UI can use
native Date parsing. TZ byte decoded per 3GPP TS 23.040 §9.2.3.11 — bit 3 is
sign, remaining bits are semi-octet-swapped BCD quarter-hours.

Parity README updated with the new "expected diffs" entry for timestamp format
and the new status field added in Task 1.
Two review-feedback fixes on the ISO 8601 SCTS commit:
- Parity README bullets referenced `.msg[]` but baseline.cgi.json uses
  `.messages` at the top level. Update labels (jq commands were already
  correct).
- Document the hardcoded 20xx century prefix in decode_scts.
Parses IEI 0x00 (8-bit ref) and 0x08 (16-bit ref) information elements from
the UDH. Emits reference/part/total alongside content so the CGI's existing
jq group_by(.sender + "|" + .reference) merger keeps assembling multi-part
messages correctly under the new native pipeline.
Review feedback on the UDH concat commit:
- The doc comment claimed do_decode_one clears the globals; actually the
  function self-clears at entry. Corrected so the Task 5 decode_list author
  isn't misled into pre-clearing at the call site.
- Loop guard pos+3<=length makes the 'need a full IEI+IEDL header' invariant
  explicit and avoids a spurious partial-byte read on a malformed UDHL.
DCS alphabet 0b10 (UCS-2) is the second most common encoding after GSM-7 —
used whenever a message contains any non-GSM-7 character (CJK, emoji, etc.).
UDH (when present) is byte-aligned in UCS-2 so we just skip (UDHL+1) bytes
before reading 16-bit code units. Surrogate pairs reconstruct supplementary
plane code points so emoji round-trip correctly.
Lone or mispaired surrogates (0xD800-0xDFFF) were emitted as 3-byte WTF-8,
which is invalid UTF-8. In decode_list mode (Task 5) the whole inbox is one
JSON array piped through jq at once, so a single corrupt message could make
jq reject every message. Now replaced with U+FFFD for deterministic, valid
output regardless of jq leniency. Also documents the sprintf(%c) byte-emission
portability assumption (re-verified on device in Task 11).
Reads idx|stat|hex lines from stdin and emits {"msg":[...]} — same envelope
sms_tool recv -j produces. Lets sms.sh call awk once per CGI request instead
of once per message, avoiding ~50 awk cold-starts on a fully-stocked inbox.

do_decode_one now builds its JSON into a global record_buf and only prints to
stdout when not in buffered mode, so decode_one behavior is unchanged while
decode_list collects records into the array.
Adds a UDH-multipart message followed by a plain message in one decode_list
batch. Asserts the plain message emits no reference/part/total — guards against
a future refactor moving the udh_found reset inside the UDHI branch, which
would silently leak concat fields between records.
Inbox GET now calls qcmd "AT+CMGF=0" + qcmd "AT+CMGL=4" and pipes each
(idx, stat, pdu) line into sms_pdu.awk op=decode_list. Coordination with
the rest of the modem traffic is inherited from qcmd's /var/lock/qmanager.lock
flock — no extra locking required (would risk self-deadlock).

The send / delete / delete_all POST paths still use sms_tool for now; they
share the same lock file via _sms_run and work fine. Replacing them is a
follow-up plan.
Review feedback on the native CMGL commit:
- Declare raw/pipe_in as local (codebase convention; avoids global pollution
  if the call site is ever refactored off command substitution).
- Strip trailing CR in the CMGL awk parser so \r\n line endings from the
  modem don't cause OK\r to be mis-consumed as a PDU (which would silently
  drop the last message). No-op when qcmd already strips CR.
Multi-part merge now carries the status field: result is unread if any part
is unread. Timestamp on a merged message is the min (earliest part) so a
concat SMS spanning a clock tick still sorts deterministically. Default
sort is timestamp desc — replaces the old sort_by(-.indexes[0]) which was
storage-slot order, not chronological (Quectel reuses freed slots).
The lexicographic timestamp sort equals chronological order only when all
messages share one UTC offset — true for a single modem whose SMSC stamps
a consistent TZ. Documents the invariant so cross-timezone correctness isn't
assumed by a future maintainer.
POST {"action":"mark_read","indexes":[...]} flips each index from REC UNREAD
to REC READ via AT+CMGR=<idx>,0 (mode=0 reads the message *and* clears the
unread flag as a documented side effect). Body is discarded; only the status
mutation matters. Wrapped in qcmd's existing lock.
The old grep '[0-9]*/[0-9]*' pattern never matched sms_tool's
"used: N, total: M" format, so storage.used/total always reported 0
(documented in tests/fixtures/sms/parity/README.md). Native AT+CPMS?
returns N,M in a parseable form and stays consistent with the rest of
the native pipeline.
Reflects the native backend contract: status is required (consumers must
handle "read"|"unread"), timestamp is ISO 8601 with TZ offset so lexicographic
sort matches chronological order. Frontend filter UI / mark-read wiring is
a follow-up plan.
Aligns the consumer-facing type doc with the CGI source comment — the
lexicographic==chronological claim holds only when all messages share one
SMSC timezone offset (true for a single modem).
Author-time PDU encoder used to derive the decode/ fixtures (GSM-7 + UCS-2,
SMS-DELIVER framing, optional concat UDH). Not shipped to the device and not
invoked by the test harness — purely for regenerating known-good fixture hex.
Live-device run on the RM551E test modem: content parity is byte-clean across
all 51 merged messages vs the pre-migration sms_tool baseline; storage bug
fixed (73/255); newest-first sort confirmed; mark_read plumbing verified end
to end; lock coordination clean. BusyBox awk confirmed to emit UTF-8 bytes via
sprintf("%c") identically to GAWK. One gap: the inbox had no unread messages,
so the actual unread->read flip could not be observed (plumbing proven).
Final-review hardening. The /^ERROR/ guard didn't catch +CME ERROR: / +CMS
ERROR: (they start with +). Harmless when an error is the sole response, but
a +CMGL: header followed by a +CME ERROR: (storage corruption mid-enumeration)
would consume the error line as PDU hex and surface a ghost message. Now
skipped, so the corrupt slot is dropped cleanly. Also: renumber duplicate GET
handler comment labels and mark the not-yet-implemented encode_* ops as PLANNED.
Defense-in-depth on the delete and mark_read POST handlers (both loop over a
client-supplied indexes array):
- Reject empty/non-numeric indexes via a case guard, counting each as a
  failure so an all-bad request reports partial_failure rather than success.
- delete now uses a PID-qualified temp file ($$) to match mark_read, closing
  a concurrent-request race on the previously-static /tmp path.

The injection surface was already safe (idx is double-quoted; qcmd/atcli_smd11
write a single AT line), so this is hardening, not a vuln fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant