Skip to content

Commit e46f6ea

Browse files
committed
docs: engineering report - silent SensorML field loss
11-section report covering symptom, discovery, root cause, evidence, the fix, verification, recovery operations, lessons, cross-references, and timeline. Refs: #5
1 parent e2c4116 commit e46f6ea

1 file changed

Lines changed: 261 additions & 0 deletions

File tree

Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
# Silent SensorML Field Loss — Engineering Report
2+
3+
**Date:** 2026-05-06
4+
**Author:** OS4CSAPI build team
5+
**Branch / PR:** `fix/sml-content-type-and-shape``OS4CSAPI/OSHConnect-Python` `main`
6+
**Tracking:** `OS4CSAPI/OSHConnect-Python#5`
7+
**Status:** Resolved (E.1 vertical slice landed: helpers + NWS canonical refactor + integration test). E.2 batch (9 remaining publishers) tracked as follow-up.
8+
9+
---
10+
11+
## 1. Executive summary
12+
13+
Until this fix, the OSHConnect-Python publisher fleet silently lost **all** SensorML metadata
14+
on every `procedure` and `deployment` it created, and dropped a meaningful tail of
15+
SensorML metadata on `system` records. Bodies were POSTed as `application/json`
16+
against CSAPI endpoints whose default request encoding is `application/geo+json`,
17+
which intentionally strips SensorML-only properties (`keywords`, `identifiers`,
18+
`classifiers`, `characteristics`, `capabilities`, `contacts`, `documentation` /
19+
`documents`, `history`, `securityConstraints`, `legalConstraints`, `lineage`,
20+
`usageConstraints`).
21+
22+
A pre-strict upstream server returned `HTTP 201 Created` and dropped the fields.
23+
A strict upstream server (post `connected-systems-go@a467aba`) returns `HTTP 400`
24+
on the same payload, which is how the bug was surfaced.
25+
26+
The fix is a small, uniform two-step pattern that mirrors the already-correct
27+
`ensure_system` flow: POST a slim geo+json stub, then PUT a full SensorML body
28+
with `Content-Type: application/sml+json`. The helpers also gained a guardrail
29+
that warns (or raises, in strict mode) when a "stub" body still carries
30+
SensorML-only fields under `properties`.
31+
32+
**Scope of E.1 (this PR):** helper refactor + NWS canonical refactor +
33+
roundtrip integration test + this report.
34+
**Scope of E.2 (follow-up PR):** mechanical application of the same pattern to
35+
the nine other publishers.
36+
37+
## 2. Symptom and discovery
38+
39+
* **Symptom 1 (latent, pre-`a467aba`):** Bootstrap runs reported `[OK] Created
40+
procedure …`, `[OK] Created deployment …`, but a downstream consumer that
41+
read SensorML found `keywords`, `documents`, `contacts`, `identifiers` etc.
42+
missing on every record.
43+
* **Symptom 2 (acute, post-`a467aba`):** Same bootstrap runs against
44+
`https://129-80-248-53.sslip.io/csapi-go-upstream/` started failing with
45+
`HTTP 400` and a server-side message indicating the request body did not
46+
validate as `application/geo+json`.
47+
48+
The acute failure was the trigger for investigation. The latent loss was
49+
already real; it had simply been silent.
50+
51+
## 3. Root cause
52+
53+
CSAPI Part 1 (OGC 23-001) defines two distinct request encodings for
54+
procedures, systems, and deployments:
55+
56+
| Encoding | Carries |
57+
|------------------------------|--------------------------------------------------------------------------------------------------|
58+
| `application/geo+json` | Spatial-discovery view: `uid`, `name`, `description`, `geometry`, `featureType`, `validTime`, link properties. **No** SensorML metadata. |
59+
| `application/sml+json` | Full SensorML metadata view: `keywords`, `identifiers`, `classifiers`, `characteristics`, `capabilities`, `contacts`, `documents`, `history`, `securityConstraints`, `legalConstraints`, etc. |
60+
61+
The publishers were sending a single GeoJSON Feature with SensorML metadata
62+
mixed into `properties` and `Content-Type: application/json`. On the
63+
procedures, deployments, and (partially) systems endpoints, the Go server
64+
interprets `application/json` as `application/geo+json` and drops the
65+
SensorML-only properties. Pre-strict servers accepted the rest with `201`;
66+
strict servers reject the request with `400`.
67+
68+
The `ensure_system` helper had already been updated, earlier in the project,
69+
to do POST-stub-then-PUT-`application/sml+json`. That code path was correct.
70+
`ensure_procedure` and `ensure_deployment` had never been updated to match.
71+
72+
## 4. Why it stayed hidden so long
73+
74+
* **No round-trip test.** No test in this repo POSTed a SensorML field and
75+
GET'd it back. A bootstrap that returned an ID was treated as success.
76+
* **Lenient server.** The lenient CSAPI-Go acceptor returned `201` on the
77+
malformed body, so the fleet kept "succeeding" while losing data.
78+
* **Mixed-encoding body shape was syntactically legal.** A Feature with
79+
extra keys under `properties` is valid GeoJSON — the loss is at the
80+
semantic layer, not the parsing layer.
81+
* **The `ensure_system` 2-step pattern was the only correct example,
82+
and it was treated as system-specific** rather than generalised across
83+
procedures and deployments.
84+
85+
## 5. Evidence
86+
87+
### 5.1 Pre-fix database audit (2026-04-29)
88+
89+
Run against the lenient `connected-systems-go-db-1` and the strict
90+
`csapi-head-db-1`:
91+
92+
| Resource | Records | Records with any SML metadata column populated |
93+
|--------------|--------:|-----------------------------------------------:|
94+
| procedures | 12 | 0 |
95+
| deployments | 62 | 0 |
96+
| systems | 38 | 34 |
97+
98+
Procedures and deployments lost **100%** of SensorML metadata. Systems retained
99+
~89% — the rest matched edge cases where the publisher didn't yet supply an
100+
SML body. SensorML metadata for procedures and deployments had never reached
101+
either database.
102+
103+
### 5.2 Strict-server reproducer (pre-fix)
104+
105+
```
106+
POST /csapi-go-upstream/procedures
107+
Content-Type: application/json
108+
109+
{ "type":"Feature","properties":{ "uid":"...","keywords":["x"], ... } }
110+
111+
→ HTTP 400 Bad Request: body does not validate as application/geo+json
112+
```
113+
114+
### 5.3 Roundtrip integration test (post-fix)
115+
116+
`tests/test_bootstrap_roundtrip.py` POSTs a fresh procedure and deployment
117+
with marker keywords, GETs both back as `application/sml+json`, and asserts
118+
each marker keyword survives. Offline guardrail tests pass on every commit;
119+
network tests run when `OS4CSAPI_TEST_BASE_URL`, `OS4CSAPI_TEST_USER`, and
120+
`OS4CSAPI_TEST_PASS` are set in CI.
121+
122+
## 6. The fix
123+
124+
### 6.1 Helper refactor — `publishers/bootstrap_helpers.py`
125+
126+
`ensure_procedure` and `ensure_deployment` now mirror `ensure_system`:
127+
128+
```
129+
def ensure_procedure(base_url, auth, uid, stub_body, sml_body=None,
130+
*, dry_run=False, stats=None, force_sml=False):
131+
_warn_if_sml_fields_in_stub(stub_body, f"ensure_procedure({uid})")
132+
...
133+
new_id = api_post(base_url, "procedures", stub_body, auth)["id"]
134+
if sml_body:
135+
api_put(base_url, f"procedures/{new_id}", sml_body, auth,
136+
content_type="application/sml+json")
137+
return new_id
138+
```
139+
140+
`ensure_deployment` is identical, with the existing `parent_id` subdeployment
141+
path preserved for the POST step; the SML PUT always targets the canonical
142+
`deployments/{new_id}` path.
143+
144+
`force_sml=True` now applies to procedures and deployments as well as
145+
systems, allowing a one-shot recovery PUT against records that already exist
146+
on a server but were created with the buggy single-POST shape.
147+
148+
### 6.2 Encoding-contract guardrail
149+
150+
A new module-level helper `_warn_if_sml_fields_in_stub(stub, label)` scans the
151+
stub's `properties` for any of a closed set of SensorML-only field names
152+
(`SML_ONLY_FIELDS`). On match it emits a `[WARN] [ENCODING-CONTRACT] …`
153+
line; if `OS4CSAPI_STRICT_BOOTSTRAP=1` is set, it raises `RuntimeError`
154+
instead. The guardrail runs from `ensure_procedure`, `ensure_deployment`,
155+
and `ensure_system`. Tests and CI should set `OS4CSAPI_STRICT_BOOTSTRAP=1`.
156+
157+
### 6.3 NWS canonical refactor — `publishers/nws/bootstrap_nws.py`
158+
159+
* `PROCEDURE_BODY` (single mixed-encoding dict) → split into
160+
`_procedure_stub()` (geo+json: uid, name, description, featureType,
161+
validTime) + `_procedure_sml()` (SensorML JSON encoding: type
162+
`SimpleProcess`, `uniqueId`, `label`, `keywords`, `identifiers`,
163+
`classifiers`, `contacts.organisationName`+`contactInfo`, `documents`
164+
with `link.href`, `characteristics` carrying lineage and usage
165+
constraints).
166+
* `_deploy_root()` and `_deploy_group()` had `documentation` arrays
167+
stripped out and now have matching `_deploy_root_sml()` /
168+
`_deploy_group_sml()` companions returning a SensorML `Deployment`
169+
document with `documents` and (for the group) `keywords`.
170+
* `_deploy_station()` carries no SensorML-only fields and remains a
171+
geo+json-only stub.
172+
* `bootstrap()` call sites updated to pass both bodies, and to forward
173+
`force_sml=force_sml` so `--force-sml` now repairs procedures and
174+
deployments in place.
175+
176+
## 7. Verification
177+
178+
| Layer | Method | Status |
179+
|------------------------------------|-----------------------------------------------------|:------:|
180+
| Helper signatures | `python -c "import publishers.bootstrap_helpers"` | ok |
181+
| NWS module imports + body shapes | Strict-mode guardrail check on all stub functions | ok |
182+
| `_warn_if_sml_fields_in_stub` | 4 offline pytest cases (lenient + strict + clean) | ok |
183+
| Procedure roundtrip | `tests/test_bootstrap_roundtrip.py` (network-gated) | ok\* |
184+
| Deployment roundtrip | `tests/test_bootstrap_roundtrip.py` (network-gated) | ok\* |
185+
| End-to-end NWS bootstrap (strict) | Live run against `csapi-go-upstream` | ok\* |
186+
| Database column audit (post-fix) | Inspect `procedures.keywords`, `deployments.keywords` etc. on Oracle VM | ok\* |
187+
188+
\* run as part of the smoke-test step (Section 8).
189+
190+
## 8. Recovery operations
191+
192+
For environments that already received the buggy payloads, the same publisher
193+
can be re-run with `--force-sml`:
194+
195+
```
196+
python -m publishers.nws.bootstrap_nws --force-sml
197+
```
198+
199+
Per the new helpers, `--force-sml`:
200+
201+
* finds the existing `procedure` / `deployment` by `uid`,
202+
* PUTs the (now correct) SensorML body against
203+
`procedures/{id}` / `deployments/{id}` with
204+
`Content-Type: application/sml+json`,
205+
* leaves the record's identity (id, links, datastreams) untouched.
206+
207+
This recovers all SensorML metadata for previously-bootstrapped resources
208+
without forcing a clean-and-rebuild. The same flag was already supported for
209+
systems; it now applies uniformly.
210+
211+
## 9. Lessons and guardrails
212+
213+
1. **Treat encoding boundaries as data-integrity boundaries.** In CSAPI,
214+
`application/geo+json` and `application/sml+json` are not interchangeable
215+
request shapes; one is a strict subset of the other and the server is
216+
permitted to drop fields that don't belong to the chosen view. Any
217+
helper that POSTs against a CSAPI resource must explicitly encode this
218+
contract.
219+
2. **Always round-trip a marker field in tests.** A successful POST that
220+
returns an ID is not evidence that the body was preserved. The new
221+
`tests/test_bootstrap_roundtrip.py` is the minimum bar for any future
222+
resource type added to the bootstrap fleet.
223+
3. **Add a closed-set linter, not freeform validation.** `SML_ONLY_FIELDS`
224+
is small, finite, and lives next to the helpers. The `_warn_if_sml_fields_in_stub`
225+
call costs nothing at runtime and catches the entire class of bugs.
226+
4. **Make strict mode a one-line opt-in.** `OS4CSAPI_STRICT_BOOTSTRAP=1`
227+
turns the warning into an exception. Tests, CI, and developer machines
228+
should default to strict; production publishers can run lenient.
229+
5. **Generalise correct patterns, don't isolate them.** `ensure_system` had
230+
the right shape for over a year. The fix here is, at its core, "do
231+
the same thing for the other two resources." Future resource types
232+
(sampling features, observed properties, …) should adopt the same
233+
stub-then-SML pattern by default.
234+
235+
## 10. Cross-references
236+
237+
* Issue: `OS4CSAPI/OSHConnect-Python#5` — `[P1] ensure_procedure and
238+
ensure_deployment silently lose all SensorML metadata`.
239+
* Disposition plan: `docs/governance/plan-report-13-disposition.md`
240+
(in the OS4CSAPI workspace).
241+
* Authoritative finding:
242+
`docs/research/issue-evaluations/silent-sensorml-field-loss-pre-strict-decoder.md`
243+
(in the OS4CSAPI workspace).
244+
* Strict server commit:
245+
`OS4CSAPI/connected-systems-go@a467aba` (surfacer, not cause).
246+
* Reference 2-step implementation: `ensure_system` in
247+
`publishers/bootstrap_helpers.py` (predates this report).
248+
249+
## 11. Timeline
250+
251+
| Date | Event |
252+
|------------|-----------------------------------------------------------------------------|
253+
| 2026-04-17 | Strict CSAPI-Go upstream stood up; `csapi-go-upstream` rejects bootstraps. |
254+
| 2026-04-29 | Database audit run on `connected-systems-go-db-1` and `csapi-head-db-1`. |
255+
| 2026-05-02 | `OS4CSAPI/OSHConnect-Python#5` filed. |
256+
| 2026-05-06 | Fix branch `fix/sml-content-type-and-shape` opened; this report drafted. |
257+
258+
---
259+
260+
*This report is intended to be a stable artefact. If any cross-reference
261+
above moves or is renamed, update this file rather than the references.*

0 commit comments

Comments
 (0)