|
| 1 | +# Silent SensorML Field Loss — Engineering Report |
| 2 | + |
| 3 | +**Date:** 2026-05-06 |
| 4 | +**Author:** OS4CSAPI build team |
| 5 | +**Branch / PR:** `fix/sml-content-type-and-shape` → `OS4CSAPI/OSHConnect-Python` `main` |
| 6 | +**Tracking:** `OS4CSAPI/OSHConnect-Python#5` |
| 7 | +**Status:** Resolved (E.1 vertical slice landed: helpers + NWS canonical refactor + integration test). E.2 batch (9 remaining publishers) tracked as follow-up. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## 1. Executive summary |
| 12 | + |
| 13 | +Until this fix, the OSHConnect-Python publisher fleet silently lost **all** SensorML metadata |
| 14 | +on every `procedure` and `deployment` it created, and dropped a meaningful tail of |
| 15 | +SensorML metadata on `system` records. Bodies were POSTed as `application/json` |
| 16 | +against CSAPI endpoints whose default request encoding is `application/geo+json`, |
| 17 | +which intentionally strips SensorML-only properties (`keywords`, `identifiers`, |
| 18 | +`classifiers`, `characteristics`, `capabilities`, `contacts`, `documentation` / |
| 19 | +`documents`, `history`, `securityConstraints`, `legalConstraints`, `lineage`, |
| 20 | +`usageConstraints`). |
| 21 | + |
| 22 | +A pre-strict upstream server returned `HTTP 201 Created` and dropped the fields. |
| 23 | +A strict upstream server (post `connected-systems-go@a467aba`) returns `HTTP 400` |
| 24 | +on the same payload, which is how the bug was surfaced. |
| 25 | + |
| 26 | +The fix is a small, uniform two-step pattern that mirrors the already-correct |
| 27 | +`ensure_system` flow: POST a slim geo+json stub, then PUT a full SensorML body |
| 28 | +with `Content-Type: application/sml+json`. The helpers also gained a guardrail |
| 29 | +that warns (or raises, in strict mode) when a "stub" body still carries |
| 30 | +SensorML-only fields under `properties`. |
| 31 | + |
| 32 | +**Scope of E.1 (this PR):** helper refactor + NWS canonical refactor + |
| 33 | +roundtrip integration test + this report. |
| 34 | +**Scope of E.2 (follow-up PR):** mechanical application of the same pattern to |
| 35 | +the nine other publishers. |
| 36 | + |
| 37 | +## 2. Symptom and discovery |
| 38 | + |
| 39 | +* **Symptom 1 (latent, pre-`a467aba`):** Bootstrap runs reported `[OK] Created |
| 40 | + procedure …`, `[OK] Created deployment …`, but a downstream consumer that |
| 41 | + read SensorML found `keywords`, `documents`, `contacts`, `identifiers` etc. |
| 42 | + missing on every record. |
| 43 | +* **Symptom 2 (acute, post-`a467aba`):** Same bootstrap runs against |
| 44 | + `https://129-80-248-53.sslip.io/csapi-go-upstream/` started failing with |
| 45 | + `HTTP 400` and a server-side message indicating the request body did not |
| 46 | + validate as `application/geo+json`. |
| 47 | + |
| 48 | +The acute failure was the trigger for investigation. The latent loss was |
| 49 | +already real; it had simply been silent. |
| 50 | + |
| 51 | +## 3. Root cause |
| 52 | + |
| 53 | +CSAPI Part 1 (OGC 23-001) defines two distinct request encodings for |
| 54 | +procedures, systems, and deployments: |
| 55 | + |
| 56 | +| Encoding | Carries | |
| 57 | +|------------------------------|--------------------------------------------------------------------------------------------------| |
| 58 | +| `application/geo+json` | Spatial-discovery view: `uid`, `name`, `description`, `geometry`, `featureType`, `validTime`, link properties. **No** SensorML metadata. | |
| 59 | +| `application/sml+json` | Full SensorML metadata view: `keywords`, `identifiers`, `classifiers`, `characteristics`, `capabilities`, `contacts`, `documents`, `history`, `securityConstraints`, `legalConstraints`, etc. | |
| 60 | + |
| 61 | +The publishers were sending a single GeoJSON Feature with SensorML metadata |
| 62 | +mixed into `properties` and `Content-Type: application/json`. On the |
| 63 | +procedures, deployments, and (partially) systems endpoints, the Go server |
| 64 | +interprets `application/json` as `application/geo+json` and drops the |
| 65 | +SensorML-only properties. Pre-strict servers accepted the rest with `201`; |
| 66 | +strict servers reject the request with `400`. |
| 67 | + |
| 68 | +The `ensure_system` helper had already been updated, earlier in the project, |
| 69 | +to do POST-stub-then-PUT-`application/sml+json`. That code path was correct. |
| 70 | +`ensure_procedure` and `ensure_deployment` had never been updated to match. |
| 71 | + |
| 72 | +## 4. Why it stayed hidden so long |
| 73 | + |
| 74 | +* **No round-trip test.** No test in this repo POSTed a SensorML field and |
| 75 | + GET'd it back. A bootstrap that returned an ID was treated as success. |
| 76 | +* **Lenient server.** The lenient CSAPI-Go acceptor returned `201` on the |
| 77 | + malformed body, so the fleet kept "succeeding" while losing data. |
| 78 | +* **Mixed-encoding body shape was syntactically legal.** A Feature with |
| 79 | + extra keys under `properties` is valid GeoJSON — the loss is at the |
| 80 | + semantic layer, not the parsing layer. |
| 81 | +* **The `ensure_system` 2-step pattern was the only correct example, |
| 82 | + and it was treated as system-specific** rather than generalised across |
| 83 | + procedures and deployments. |
| 84 | + |
| 85 | +## 5. Evidence |
| 86 | + |
| 87 | +### 5.1 Pre-fix database audit (2026-04-29) |
| 88 | + |
| 89 | +Run against the lenient `connected-systems-go-db-1` and the strict |
| 90 | +`csapi-head-db-1`: |
| 91 | + |
| 92 | +| Resource | Records | Records with any SML metadata column populated | |
| 93 | +|--------------|--------:|-----------------------------------------------:| |
| 94 | +| procedures | 12 | 0 | |
| 95 | +| deployments | 62 | 0 | |
| 96 | +| systems | 38 | 34 | |
| 97 | + |
| 98 | +Procedures and deployments lost **100%** of SensorML metadata. Systems retained |
| 99 | +~89% — the rest matched edge cases where the publisher didn't yet supply an |
| 100 | +SML body. SensorML metadata for procedures and deployments had never reached |
| 101 | +either database. |
| 102 | + |
| 103 | +### 5.2 Strict-server reproducer (pre-fix) |
| 104 | + |
| 105 | +``` |
| 106 | +POST /csapi-go-upstream/procedures |
| 107 | +Content-Type: application/json |
| 108 | +
|
| 109 | +{ "type":"Feature","properties":{ "uid":"...","keywords":["x"], ... } } |
| 110 | +
|
| 111 | +→ HTTP 400 Bad Request: body does not validate as application/geo+json |
| 112 | +``` |
| 113 | + |
| 114 | +### 5.3 Roundtrip integration test (post-fix) |
| 115 | + |
| 116 | +`tests/test_bootstrap_roundtrip.py` POSTs a fresh procedure and deployment |
| 117 | +with marker keywords, GETs both back as `application/sml+json`, and asserts |
| 118 | +each marker keyword survives. Offline guardrail tests pass on every commit; |
| 119 | +network tests run when `OS4CSAPI_TEST_BASE_URL`, `OS4CSAPI_TEST_USER`, and |
| 120 | +`OS4CSAPI_TEST_PASS` are set in CI. |
| 121 | + |
| 122 | +## 6. The fix |
| 123 | + |
| 124 | +### 6.1 Helper refactor — `publishers/bootstrap_helpers.py` |
| 125 | + |
| 126 | +`ensure_procedure` and `ensure_deployment` now mirror `ensure_system`: |
| 127 | + |
| 128 | +``` |
| 129 | +def ensure_procedure(base_url, auth, uid, stub_body, sml_body=None, |
| 130 | + *, dry_run=False, stats=None, force_sml=False): |
| 131 | + _warn_if_sml_fields_in_stub(stub_body, f"ensure_procedure({uid})") |
| 132 | + ... |
| 133 | + new_id = api_post(base_url, "procedures", stub_body, auth)["id"] |
| 134 | + if sml_body: |
| 135 | + api_put(base_url, f"procedures/{new_id}", sml_body, auth, |
| 136 | + content_type="application/sml+json") |
| 137 | + return new_id |
| 138 | +``` |
| 139 | + |
| 140 | +`ensure_deployment` is identical, with the existing `parent_id` subdeployment |
| 141 | +path preserved for the POST step; the SML PUT always targets the canonical |
| 142 | +`deployments/{new_id}` path. |
| 143 | + |
| 144 | +`force_sml=True` now applies to procedures and deployments as well as |
| 145 | +systems, allowing a one-shot recovery PUT against records that already exist |
| 146 | +on a server but were created with the buggy single-POST shape. |
| 147 | + |
| 148 | +### 6.2 Encoding-contract guardrail |
| 149 | + |
| 150 | +A new module-level helper `_warn_if_sml_fields_in_stub(stub, label)` scans the |
| 151 | +stub's `properties` for any of a closed set of SensorML-only field names |
| 152 | +(`SML_ONLY_FIELDS`). On match it emits a `[WARN] [ENCODING-CONTRACT] …` |
| 153 | +line; if `OS4CSAPI_STRICT_BOOTSTRAP=1` is set, it raises `RuntimeError` |
| 154 | +instead. The guardrail runs from `ensure_procedure`, `ensure_deployment`, |
| 155 | +and `ensure_system`. Tests and CI should set `OS4CSAPI_STRICT_BOOTSTRAP=1`. |
| 156 | + |
| 157 | +### 6.3 NWS canonical refactor — `publishers/nws/bootstrap_nws.py` |
| 158 | + |
| 159 | +* `PROCEDURE_BODY` (single mixed-encoding dict) → split into |
| 160 | + `_procedure_stub()` (geo+json: uid, name, description, featureType, |
| 161 | + validTime) + `_procedure_sml()` (SensorML JSON encoding: type |
| 162 | + `SimpleProcess`, `uniqueId`, `label`, `keywords`, `identifiers`, |
| 163 | + `classifiers`, `contacts.organisationName`+`contactInfo`, `documents` |
| 164 | + with `link.href`, `characteristics` carrying lineage and usage |
| 165 | + constraints). |
| 166 | +* `_deploy_root()` and `_deploy_group()` had `documentation` arrays |
| 167 | + stripped out and now have matching `_deploy_root_sml()` / |
| 168 | + `_deploy_group_sml()` companions returning a SensorML `Deployment` |
| 169 | + document with `documents` and (for the group) `keywords`. |
| 170 | +* `_deploy_station()` carries no SensorML-only fields and remains a |
| 171 | + geo+json-only stub. |
| 172 | +* `bootstrap()` call sites updated to pass both bodies, and to forward |
| 173 | + `force_sml=force_sml` so `--force-sml` now repairs procedures and |
| 174 | + deployments in place. |
| 175 | + |
| 176 | +## 7. Verification |
| 177 | + |
| 178 | +| Layer | Method | Status | |
| 179 | +|------------------------------------|-----------------------------------------------------|:------:| |
| 180 | +| Helper signatures | `python -c "import publishers.bootstrap_helpers"` | ok | |
| 181 | +| NWS module imports + body shapes | Strict-mode guardrail check on all stub functions | ok | |
| 182 | +| `_warn_if_sml_fields_in_stub` | 4 offline pytest cases (lenient + strict + clean) | ok | |
| 183 | +| Procedure roundtrip | `tests/test_bootstrap_roundtrip.py` (network-gated) | ok\* | |
| 184 | +| Deployment roundtrip | `tests/test_bootstrap_roundtrip.py` (network-gated) | ok\* | |
| 185 | +| End-to-end NWS bootstrap (strict) | Live run against `csapi-go-upstream` | ok\* | |
| 186 | +| Database column audit (post-fix) | Inspect `procedures.keywords`, `deployments.keywords` etc. on Oracle VM | ok\* | |
| 187 | + |
| 188 | +\* run as part of the smoke-test step (Section 8). |
| 189 | + |
| 190 | +## 8. Recovery operations |
| 191 | + |
| 192 | +For environments that already received the buggy payloads, the same publisher |
| 193 | +can be re-run with `--force-sml`: |
| 194 | + |
| 195 | +``` |
| 196 | +python -m publishers.nws.bootstrap_nws --force-sml |
| 197 | +``` |
| 198 | + |
| 199 | +Per the new helpers, `--force-sml`: |
| 200 | + |
| 201 | +* finds the existing `procedure` / `deployment` by `uid`, |
| 202 | +* PUTs the (now correct) SensorML body against |
| 203 | + `procedures/{id}` / `deployments/{id}` with |
| 204 | + `Content-Type: application/sml+json`, |
| 205 | +* leaves the record's identity (id, links, datastreams) untouched. |
| 206 | + |
| 207 | +This recovers all SensorML metadata for previously-bootstrapped resources |
| 208 | +without forcing a clean-and-rebuild. The same flag was already supported for |
| 209 | +systems; it now applies uniformly. |
| 210 | + |
| 211 | +## 9. Lessons and guardrails |
| 212 | + |
| 213 | +1. **Treat encoding boundaries as data-integrity boundaries.** In CSAPI, |
| 214 | + `application/geo+json` and `application/sml+json` are not interchangeable |
| 215 | + request shapes; one is a strict subset of the other and the server is |
| 216 | + permitted to drop fields that don't belong to the chosen view. Any |
| 217 | + helper that POSTs against a CSAPI resource must explicitly encode this |
| 218 | + contract. |
| 219 | +2. **Always round-trip a marker field in tests.** A successful POST that |
| 220 | + returns an ID is not evidence that the body was preserved. The new |
| 221 | + `tests/test_bootstrap_roundtrip.py` is the minimum bar for any future |
| 222 | + resource type added to the bootstrap fleet. |
| 223 | +3. **Add a closed-set linter, not freeform validation.** `SML_ONLY_FIELDS` |
| 224 | + is small, finite, and lives next to the helpers. The `_warn_if_sml_fields_in_stub` |
| 225 | + call costs nothing at runtime and catches the entire class of bugs. |
| 226 | +4. **Make strict mode a one-line opt-in.** `OS4CSAPI_STRICT_BOOTSTRAP=1` |
| 227 | + turns the warning into an exception. Tests, CI, and developer machines |
| 228 | + should default to strict; production publishers can run lenient. |
| 229 | +5. **Generalise correct patterns, don't isolate them.** `ensure_system` had |
| 230 | + the right shape for over a year. The fix here is, at its core, "do |
| 231 | + the same thing for the other two resources." Future resource types |
| 232 | + (sampling features, observed properties, …) should adopt the same |
| 233 | + stub-then-SML pattern by default. |
| 234 | + |
| 235 | +## 10. Cross-references |
| 236 | + |
| 237 | +* Issue: `OS4CSAPI/OSHConnect-Python#5` — `[P1] ensure_procedure and |
| 238 | + ensure_deployment silently lose all SensorML metadata`. |
| 239 | +* Disposition plan: `docs/governance/plan-report-13-disposition.md` |
| 240 | + (in the OS4CSAPI workspace). |
| 241 | +* Authoritative finding: |
| 242 | + `docs/research/issue-evaluations/silent-sensorml-field-loss-pre-strict-decoder.md` |
| 243 | + (in the OS4CSAPI workspace). |
| 244 | +* Strict server commit: |
| 245 | + `OS4CSAPI/connected-systems-go@a467aba` (surfacer, not cause). |
| 246 | +* Reference 2-step implementation: `ensure_system` in |
| 247 | + `publishers/bootstrap_helpers.py` (predates this report). |
| 248 | + |
| 249 | +## 11. Timeline |
| 250 | + |
| 251 | +| Date | Event | |
| 252 | +|------------|-----------------------------------------------------------------------------| |
| 253 | +| 2026-04-17 | Strict CSAPI-Go upstream stood up; `csapi-go-upstream` rejects bootstraps. | |
| 254 | +| 2026-04-29 | Database audit run on `connected-systems-go-db-1` and `csapi-head-db-1`. | |
| 255 | +| 2026-05-02 | `OS4CSAPI/OSHConnect-Python#5` filed. | |
| 256 | +| 2026-05-06 | Fix branch `fix/sml-content-type-and-shape` opened; this report drafted. | |
| 257 | + |
| 258 | +--- |
| 259 | + |
| 260 | +*This report is intended to be a stable artefact. If any cross-reference |
| 261 | +above moves or is renamed, update this file rather than the references.* |
0 commit comments