You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This adds a Hermes-backed LoCoMo benchmark runner under benchmark/locomo/hermes for comparing three memory paths:
Hermes native memory baseline
Hermes-to-OpenViking E2E ingestion
OpenViking pre-ingest queried through Hermes
The runner wires import, QA evaluation, judging, and final statistics into one repeatable flow while keeping the LoCoMo dataset and generated benchmark artifacts outside the PR.
Changes
Add run_full_eval.sh to orchestrate suite selection, result directories, retries, optional OpenViking checkpoints, and E2E target/archive readiness checks.
Add importers for native Hermes memory, Hermes/OpenViking E2E session ingestion, and direct OpenViking pre-ingest, using flattened LoCoMo session transcripts with timestamp and visual metadata.
Add shared QA, judge, and stats helpers with retry handling, tool-call accounting, Hermes state.db token/cache summaries, and OpenViking observer token deltas.
Notes
The LoCoMo dataset is intentionally not included. Use LOCOMO_JSON=/path/to/locomo10.json or place a local copy at the documented path.
Benchmark tests and generated result/checkpoint directories are intentionally not included.
state.db is used when available for authoritative Hermes token/cache accounting because gateway CSV token fields can be lossy.
Validation
uv run ruff format --check benchmark/locomo/hermes/*.py
uv run ruff check benchmark/locomo/hermes/*.py
bash -n benchmark/locomo/hermes/run_full_eval.sh
uv run python -m py_compile benchmark/locomo/hermes/*.py
./benchmark/locomo/hermes/run_full_eval.sh --help
Python script --help paths load for import, eval, judge, and stats helpers
Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.
Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.
Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.
Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.
Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds a Hermes-backed LoCoMo benchmark runner under
benchmark/locomo/hermesfor comparing three memory paths:The runner wires import, QA evaluation, judging, and final statistics into one repeatable flow while keeping the LoCoMo dataset and generated benchmark artifacts outside the PR.
Changes
run_full_eval.shto orchestrate suite selection, result directories, retries, optional OpenViking checkpoints, and E2E target/archive readiness checks.state.dbtoken/cache summaries, and OpenViking observer token deltas.Notes
LOCOMO_JSON=/path/to/locomo10.jsonor place a local copy at the documented path.state.dbis used when available for authoritative Hermes token/cache accounting because gateway CSV token fields can be lossy.Validation
uv run ruff format --check benchmark/locomo/hermes/*.pyuv run ruff check benchmark/locomo/hermes/*.pybash -n benchmark/locomo/hermes/run_full_eval.shuv run python -m py_compile benchmark/locomo/hermes/*.py./benchmark/locomo/hermes/run_full_eval.sh --help--helppaths load for import, eval, judge, and stats helpers