Skip to content

Commit b1398ed

Browse files
author
iPythoning
committed
feat(whatsapp-onboarding): v0.3 — Layer B miner + MemOS upsert + Layer C embed
Closes the remaining gap between a spec/SOP and an end-to-end delivery pipeline. Bootstrap.sh now runs: exports/*.txt -> parsed/*.jsonl -> profiles/*.yaml (Layer A, MemOS-ready) -> golden/*.yaml (Layer B, awaiting human review) -> layer-c-chunks.jsonl (Layer C, KB-ready) -> optional push to PulseAgent Added scripts: - mine-golden-segments.py: keyword + Haiku two-pass Layer B miner with five tag classes (deal_close / objection_resolved / dunning_recovered / relationship_warmup / cross_sell) across EN / ZH / ES signals. - memos-upsert.py: profile push to PA /api/memos/upsert; honors _auto_onboard gate, supports --force. - bulk-embed.py: chunks parsed/*.jsonl into KB records with strict customer_hash isolation metadata. Two modes: emit JSONL or --upload. Embedding stays on the PA backend by design (no local heavy ML deps). Fixes: - export parser MEDIA regex now eats outer <> on omitted-media markers. - bulk-embed dropped redundant media tag in chunked text. Config: - All three PA-talking scripts honor CLI > pa-config.json > env vars. - samples/pa-config.example.json shows the expected schema. Self-tested end-to-end on synthetic iOS+Android exports.
1 parent d0b5c5e commit b1398ed

8 files changed

Lines changed: 684 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,40 @@ Changes sourced from upstream (openclaw/openclaw) are labeled with the originati
88

99
## [Unreleased]
1010

11+
## 2026-05-21 — WhatsApp Onboarding Spec v0.3 (Layer B + Layer C scripts)
12+
13+
Closes the remaining gap to a true end-to-end delivery: Layer B miner,
14+
Layer A pusher, and Layer C chunk uploader. `bootstrap.sh` now wires
15+
them all together so a complete run produces:
16+
17+
- `profiles/` → MemOS-ready customer YAMLs
18+
- `golden/` → Layer B segments awaiting human review
19+
- `layer-c-chunks.jsonl` → conversation history chunks for KB import
20+
21+
### Added
22+
23+
- **scripts/mine-golden-segments.py** — Two-pass Layer B miner:
24+
pass 1 sliding-window keyword detection (EN/ZH/ES signals across five
25+
tag classes), pass 2 Haiku LLM scoring + retag + tactical-move
26+
extraction. Drops segments scoring < 3.
27+
- **scripts/memos-upsert.py** — Pushes `profiles/*.yaml` to PulseAgent
28+
MemOS endpoint. Honors `_auto_onboard` gate; supports `--force`.
29+
Config sources: CLI > pa-config.json > env vars.
30+
- **scripts/bulk-embed.py** — Chunks `parsed/*.jsonl` into KB-ready
31+
records with strict `customer_hash` metadata. Two modes:
32+
emit JSONL for offline import, or `--upload` to push directly to
33+
`/api/kb/upsert`. Embedding stays on the PA backend by design.
34+
- **samples/pa-config.example.json** — Reference shape for
35+
`~/.pa-config.json`.
36+
- **bootstrap.sh** wires mining + Layer C chunking + optional push step
37+
into the standard delivery flow.
38+
39+
### Fixed
40+
41+
- `whatsapp-export-parser.py` MEDIA regex now strips outer `<>` around
42+
`<image omitted>` / `<Media omitted>` so chunked text is clean.
43+
- `bulk-embed.py` removed duplicated media tag in chunk text body.
44+
1145
## 2026-05-21 — WhatsApp Onboarding Spec v0.2 (customer delivery kit)
1246

1347
Turns the v0.1 spec into something a delivery engineer can actually run on

whatsapp-old-account-onboarding/docs/README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,15 +226,23 @@ Any metric 2× baseline → pause expansion, audit prompt + samples.
226226
```
227227
whatsapp-old-account-onboarding/
228228
├── scripts/
229+
│ ├── bootstrap.sh ← one-command entry
229230
│ ├── whatsapp-export-parser.py
230-
│ └── customer-profile-extractor.py
231+
│ ├── customer-profile-extractor.py
232+
│ ├── mine-golden-segments.py ← Layer B miner
233+
│ ├── memos-upsert.py ← Layer A push
234+
│ ├── bulk-embed.py ← Layer C chunk + push
235+
│ └── requirements.txt
231236
├── docs/
232237
│ ├── README.md ← you are here
233238
│ ├── README.zh-CN.md
239+
│ ├── CUSTOMER-DELIVERY-GUIDE.md
240+
│ ├── CUSTOMER-DELIVERY-GUIDE.zh-CN.md
234241
│ ├── OpenClaw-knowledge-base-import.md
235242
│ └── system-prompt-template.md
236243
└── samples/
237-
└── example-customer-profile.yaml
244+
├── example-customer-profile.yaml
245+
└── pa-config.example.json ← copy to ~/.pa-config.json
238246
```
239247

240248
---
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"_comment": "Copy to ~/.pa-config.json and fill in. Read by memos-upsert.py and bulk-embed.py.",
3+
"endpoint": "https://your-pulseagent-host.example.com",
4+
"token": "Bearer-token-from-PA-settings",
5+
"tenant": "your-tenant-slug"
6+
}

whatsapp-old-account-onboarding/scripts/bootstrap.sh

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,44 @@ else
214214
--parsed "${PROJECT_DIR}/parsed" \
215215
--output "${PROJECT_DIR}/profiles" \
216216
--min-turns 20 2>&1 | tee -a "${LOG_FILE}"
217+
218+
say ""
219+
info "Mining Layer B golden segments (sales playbook)..."
220+
python3 "${SCRIPT_DIR}/mine-golden-segments.py" \
221+
--parsed "${PROJECT_DIR}/parsed" \
222+
--output "${PROJECT_DIR}/golden" \
223+
--min-score 3 2>&1 | tee -a "${LOG_FILE}"
224+
225+
say ""
226+
info "Chunking Layer C conversation history..."
227+
python3 "${SCRIPT_DIR}/bulk-embed.py" \
228+
--parsed "${PROJECT_DIR}/parsed" \
229+
--output "${PROJECT_DIR}/layer-c-chunks.jsonl" 2>&1 | tee -a "${LOG_FILE}"
230+
fi
231+
232+
# ---- Step 5b: optional push to PulseAgent -----------------------------------
233+
234+
if [[ "${DELIVERY_PATH}" != "A" ]]; then
235+
say ""
236+
ask_choice "Push to PulseAgent now?" PUSH_NOW \
237+
"No, I'll review and push later" \
238+
"Yes, push profiles to MemOS + chunks to KB"
239+
240+
if [[ "${PUSH_NOW}" == Yes* ]]; then
241+
if [[ ! -f "${HOME}/.pa-config.json" && -z "${PA_ENDPOINT:-}" ]]; then
242+
warn "No ~/.pa-config.json or PA_ENDPOINT env var found."
243+
ask "PulseAgent endpoint URL (e.g. https://pa.example.com)" PA_ENDPOINT
244+
ask "PulseAgent API token (Bearer)" PA_TOKEN
245+
ask "Tenant slug" PA_TENANT
246+
export PA_ENDPOINT PA_TOKEN PA_TENANT
247+
fi
248+
say ""
249+
info "Upserting profiles to MemOS..."
250+
python3 "${SCRIPT_DIR}/memos-upsert.py" --profiles "${PROJECT_DIR}/profiles" 2>&1 | tee -a "${LOG_FILE}"
251+
say ""
252+
info "Uploading conversation chunks to KB..."
253+
python3 "${SCRIPT_DIR}/bulk-embed.py" --parsed "${PROJECT_DIR}/parsed" --upload 2>&1 | tee -a "${LOG_FILE}"
254+
fi
217255
fi
218256

219257
# ---- Step 6: verification report --------------------------------------------
@@ -239,9 +277,11 @@ say "============================================================"
239277
ok "Bootstrap complete."
240278
say ""
241279
say "Next steps:"
242-
say " 1. Open ${PROJECT_DIR}/profiles/_manual_review.txt and triage."
243-
say " 2. Push approved profiles to MemOS (see docs/README.md Step 4)."
244-
say " 3. Run Layer B segment mining (docs/OpenClaw-knowledge-base-import.md)."
280+
say " 1. Open ${PROJECT_DIR}/profiles/_manual_review.txt and triage gated customers."
281+
say " 2. Manually audit ${PROJECT_DIR}/golden/*.yaml — set _human_reviewed: true on keepers."
282+
say " 3. If you skipped the push step, run:"
283+
say " python3 scripts/memos-upsert.py --profiles profiles"
284+
say " python3 scripts/bulk-embed.py --parsed parsed --upload"
245285
say " 4. Configure system prompt (docs/system-prompt-template.md)."
246286
say " 5. Run the 5 pre-launch verification cases before going live."
247287
say ""
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Bulk Embed — Layer C upload
4+
5+
Chunks ./parsed/<customer_hash>.jsonl into KB-ready records and uploads to
6+
PulseAgent / OpenClaw Knowledge Base `conversation_history` collection.
7+
8+
By design we DO NOT compute embeddings locally — the PulseAgent KB backend
9+
owns embedding model selection so all customers stay consistent. This script
10+
just slices conversation turns into chunks + metadata.
11+
12+
Two output modes:
13+
--upload POST each chunk to PA `/api/kb/upsert` endpoint
14+
(default) Emit a single ./layer-c-chunks.jsonl for offline import
15+
16+
Chunk schema (per record):
17+
{
18+
"collection": "conversation_history",
19+
"customer_hash": "<16-hex>", <-- isolation key
20+
"session_id": "<hash>-<isoTs>",
21+
"chunk_id": "<hash>-c0042",
22+
"ts_start": "ISO timestamp",
23+
"ts_end": "ISO timestamp",
24+
"turn_count": 8,
25+
"text": "[me] ...\n[customer] ...\n..."
26+
}
27+
28+
Config sources mirror memos-upsert.py (pa-config.json / env vars / CLI).
29+
30+
Usage:
31+
python bulk-embed.py --parsed ./parsed # emit JSONL
32+
python bulk-embed.py --parsed ./parsed --upload # push to PA
33+
python bulk-embed.py --parsed ./parsed --chunk-size 12 # larger chunks
34+
"""
35+
from __future__ import annotations
36+
37+
import argparse
38+
import json
39+
import os
40+
import sys
41+
from dataclasses import dataclass, asdict
42+
from pathlib import Path
43+
from typing import Iterable
44+
from urllib import error, request
45+
46+
47+
@dataclass
48+
class Chunk:
49+
collection: str
50+
customer_hash: str
51+
session_id: str
52+
chunk_id: str
53+
ts_start: str
54+
ts_end: str
55+
turn_count: int
56+
text: str
57+
58+
59+
def load_turns(jsonl_path: Path) -> list[dict]:
60+
turns: list[dict] = []
61+
with jsonl_path.open(encoding="utf-8") as fp:
62+
for line in fp:
63+
line = line.strip()
64+
if line:
65+
turns.append(json.loads(line))
66+
return turns
67+
68+
69+
def chunk_turns(
70+
turns: list[dict], chunk_size: int, overlap: int
71+
) -> Iterable[Chunk]:
72+
if chunk_size <= overlap:
73+
raise ValueError("chunk_size must exceed overlap")
74+
if not turns:
75+
return
76+
customer_hash = turns[0]["customer_hash"]
77+
step = chunk_size - overlap
78+
idx = 0
79+
chunk_num = 0
80+
while idx < len(turns):
81+
window = turns[idx : idx + chunk_size]
82+
if not window:
83+
break
84+
ts_start, ts_end = window[0]["ts"], window[-1]["ts"]
85+
text = "\n".join(f"[{t['sender']}] {t['text']}" for t in window)
86+
yield Chunk(
87+
collection="conversation_history",
88+
customer_hash=customer_hash,
89+
session_id=window[0].get("session_id", ""),
90+
chunk_id=f"{customer_hash}-c{chunk_num:04d}",
91+
ts_start=ts_start,
92+
ts_end=ts_end,
93+
turn_count=len(window),
94+
text=text,
95+
)
96+
chunk_num += 1
97+
if idx + chunk_size >= len(turns):
98+
break
99+
idx += step
100+
101+
102+
def load_config(args: argparse.Namespace) -> dict[str, str]:
103+
cfg: dict[str, str] = {}
104+
for path in (Path.cwd() / "pa-config.json", Path.home() / ".pa-config.json"):
105+
if path.is_file():
106+
cfg.update(json.loads(path.read_text(encoding="utf-8")))
107+
break
108+
for key in ("endpoint", "token", "tenant"):
109+
env = os.environ.get(f"PA_{key.upper()}")
110+
if env:
111+
cfg[key] = env
112+
cli = getattr(args, key, None)
113+
if cli:
114+
cfg[key] = cli
115+
return cfg
116+
117+
118+
def upload_chunk(cfg: dict[str, str], chunk: Chunk) -> tuple[int, str]:
119+
url = f"{cfg['endpoint'].rstrip('/')}/api/kb/upsert"
120+
payload = json.dumps({"tenant": cfg.get("tenant"), **asdict(chunk)}).encode("utf-8")
121+
req = request.Request(
122+
url,
123+
data=payload,
124+
method="POST",
125+
headers={
126+
"Authorization": f"Bearer {cfg['token']}",
127+
"Content-Type": "application/json",
128+
},
129+
)
130+
try:
131+
with request.urlopen(req, timeout=30) as resp:
132+
return resp.status, resp.read().decode("utf-8", errors="replace")
133+
except error.HTTPError as e:
134+
return e.code, e.read().decode("utf-8", errors="replace")
135+
136+
137+
def main() -> None:
138+
ap = argparse.ArgumentParser()
139+
ap.add_argument("--parsed", type=Path, required=True)
140+
ap.add_argument("--output", type=Path, default=Path("./layer-c-chunks.jsonl"))
141+
ap.add_argument("--chunk-size", type=int, default=8)
142+
ap.add_argument("--chunk-overlap", type=int, default=2)
143+
ap.add_argument("--upload", action="store_true", help="POST chunks to PA KB endpoint")
144+
ap.add_argument("--endpoint")
145+
ap.add_argument("--token")
146+
ap.add_argument("--tenant")
147+
args = ap.parse_args()
148+
149+
cfg = load_config(args)
150+
if args.upload:
151+
missing = [k for k in ("endpoint", "token") if k not in cfg]
152+
if missing:
153+
sys.exit(f"--upload requires config keys: {missing}. See script docstring.")
154+
155+
total_chunks = 0
156+
total_uploads_ok = 0
157+
total_uploads_err = 0
158+
159+
if not args.upload:
160+
out_fp = args.output.open("w", encoding="utf-8")
161+
else:
162+
out_fp = None
163+
164+
try:
165+
for jsonl in sorted(args.parsed.glob("*.jsonl")):
166+
turns = load_turns(jsonl)
167+
chunks = list(chunk_turns(turns, args.chunk_size, args.chunk_overlap))
168+
total_chunks += len(chunks)
169+
print(f"[chunk] {jsonl.stem}: {len(turns)} turns -> {len(chunks)} chunks")
170+
171+
for c in chunks:
172+
if out_fp:
173+
out_fp.write(json.dumps(asdict(c), ensure_ascii=False) + "\n")
174+
else:
175+
code, body = upload_chunk(cfg, c)
176+
if 200 <= code < 300:
177+
total_uploads_ok += 1
178+
else:
179+
total_uploads_err += 1
180+
snippet = body[:120].replace("\n", " ")
181+
print(f"[err] {c.chunk_id} -> HTTP {code}: {snippet}")
182+
finally:
183+
if out_fp:
184+
out_fp.close()
185+
186+
if args.upload:
187+
print(f"\nUploaded {total_uploads_ok}/{total_chunks} chunks; errors {total_uploads_err}.")
188+
else:
189+
print(f"\nWrote {total_chunks} chunks to {args.output}")
190+
print("Next: feed to PulseAgent KB importer with collection=conversation_history.")
191+
192+
193+
if __name__ == "__main__":
194+
main()

0 commit comments

Comments
 (0)