Skip to content

feat(benchmark): add Hermes OpenViking LoCoMo scripts#1985

Draft
ehz0ah wants to merge 1 commit into
volcengine:mainfrom
ehz0ah:feat/hermes-openviking-benchmark
Draft

feat(benchmark): add Hermes OpenViking LoCoMo scripts#1985
ehz0ah wants to merge 1 commit into
volcengine:mainfrom
ehz0ah:feat/hermes-openviking-benchmark

Conversation

@ehz0ah
Copy link
Copy Markdown
Contributor

@ehz0ah ehz0ah commented May 12, 2026

Summary

This adds a Hermes-backed LoCoMo benchmark runner under benchmark/locomo/hermes for comparing three memory paths:

  • Hermes native memory baseline
  • Hermes-to-OpenViking E2E ingestion
  • OpenViking pre-ingest queried through Hermes

The runner wires import, QA evaluation, judging, and final statistics into one repeatable flow while keeping the LoCoMo dataset and generated benchmark artifacts outside the PR.

Changes

  • Add run_full_eval.sh to orchestrate suite selection, result directories, retries, optional OpenViking checkpoints, and E2E target/archive readiness checks.
  • Add importers for native Hermes memory, Hermes/OpenViking E2E session ingestion, and direct OpenViking pre-ingest, using flattened LoCoMo session transcripts with timestamp and visual metadata.
  • Add shared QA, judge, and stats helpers with retry handling, tool-call accounting, Hermes state.db token/cache summaries, and OpenViking observer token deltas.

Notes

  • The LoCoMo dataset is intentionally not included. Use LOCOMO_JSON=/path/to/locomo10.json or place a local copy at the documented path.
  • Benchmark tests and generated result/checkpoint directories are intentionally not included.
  • state.db is used when available for authoritative Hermes token/cache accounting because gateway CSV token fields can be lossy.

Validation

  • uv run ruff format --check benchmark/locomo/hermes/*.py
  • uv run ruff check benchmark/locomo/hermes/*.py
  • bash -n benchmark/locomo/hermes/run_full_eval.sh
  • uv run python -m py_compile benchmark/locomo/hermes/*.py
  • ./benchmark/locomo/hermes/run_full_eval.sh --help
  • Python script --help paths load for import, eval, judge, and stats helpers

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 85
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add LoCoMo import scripts

Relevant files:

  • benchmark/locomo/hermes/import_e2e.py
  • benchmark/locomo/hermes/import_to_ov.py
  • benchmark/locomo/hermes/import_to_native.py

Sub-PR theme: Add LoCoMo eval and judge scripts

Relevant files:

  • benchmark/locomo/hermes/eval.py
  • benchmark/locomo/hermes/judge.py

Sub-PR theme: Add LoCoMo stats and runner script

Relevant files:

  • benchmark/locomo/hermes/stat_judge_result.py
  • benchmark/locomo/hermes/run_full_eval.sh

⚡ Recommended focus areas for review

Error Handling

Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.

except Exception:
    return None
Error Handling

Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.

except Exception:
    continue
Error Handling

Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.

return None
Error Handling

Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.

resp = await client.get(f"{openviking_url}/api/v1/observer/models")
if resp.status_code != 200:
Error Handling

Broad except Exception clauses without logging could hide real issues. Add logging (print is acceptable for benchmark scripts) or narrow the exception types.

except Exception:
    return None

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@ehz0ah ehz0ah force-pushed the feat/hermes-openviking-benchmark branch from 172c5fa to 99abd11 Compare May 12, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant