-
Notifications
You must be signed in to change notification settings - Fork 0
Document synthetic dataset ingestion workflow #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| 3. Persist relationships in a similar fashion using `MeshMind.store_triplets` and the generated CSV payload (for example, with | ||
| `csv.DictReader`). | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document required entity label when loading synthetic triplets
Step 3 describes ingesting triplets.csv with MeshMind.store_triplets but the CSV produced by scripts/generate_synthetic_dataset.py only contains subject,predicate,object,namespace,metadata. meshmind.core.types.Triplet also requires an entity_label (line 40), so constructing Triplet(**row) from the CSV as implied raises a validation error. The guide needs to either add entity_label to the generated data or instruct readers which label to inject before calling store_triplets.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codex take care of this and check that the code in scripts/generate_synthetic_dataset.py will also generate the .csv file rows so that each contain an entity_label, if it doesn't already then implement it. Make sure the guide and any other references are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
- Updated
scripts/generate_synthetic_dataset.pyso generated triplet CSV rows carryentity_labelvalues aligned with their source memories, ensuring direct construction ofTripletmodels during ingestion. scripts/generate_synthetic_dataset.pyL56-L121 - Refreshed retrieval, operations, README, and testing documentation to call out the new
entity_labelcolumn in synthetic triplet payloads and keep ingestion guidance accurate. docs/retrieval.mdL99-L101 docs/operations.mdL73-L78 README.mdL203-L209 docs/testing.mdL69-L73 - Logged the update in the changelog and synchronized planning/tracking artifacts to reflect the revised workflow expectations. CHANGELOG.mdL3-L8 PLAN.mdL22-L26 TODO.mdL3-L6
Testing
⚠️ python scripts/generate_synthetic_dataset.py build/tmp_dataset --memories 5 --triplets 5 --embedding-dim 8 --namespace demo(fails: ModuleNotFoundError: No module named 'numpy')
Summary
Testing
https://chatgpt.com/codex/tasks/task_b_68f19ec81f4c83219319d593ba3cdfb7