Skip to content

hyparam/codex2parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codex2parquet

mit license dependencies

A command-line tool to convert Codex session logs to Parquet format for data analysis and AI applications.

Installation

npm install -g codex2parquet

Usage

# Export Codex logs for current directory to codex_logs.parquet
codex2parquet

# Export logs from all projects
codex2parquet --all

# Export to custom filename
codex2parquet --output logs.parquet

# Export logs for a specific project directory
codex2parquet --project ~/code/myapp

# Read from a non-default Codex data directory
codex2parquet --codex-dir ~/.codex

What Gets Exported

Codex stores local data under ~/.codex by default. This tool reads:

  • ~/.codex/sessions/**/*.jsonl: current Codex rollout logs. Each line is a JSON object with timestamp, type, and payload.
  • ~/.codex/sessions/rollout-*.json: legacy rollout logs. Each file contains a session object and an items array.
  • ~/.codex/state_5.sqlite: thread metadata, including cwd, title, model, model provider, CLI version, sandbox policy, approval mode, token totals, git metadata, dynamic tools, and subagent parent/child edges.
  • ~/.codex/history.jsonl: prompt history rows with session_id, Unix timestamp, and text.
  • ~/.codex/logs_2.sqlite: diagnostic/runtime log rows when the current Node.js runtime includes node:sqlite.

The SQLite sources are optional. The exporter reads them through Node's native node:sqlite module and does not require a system sqlite3 command. If the SQLite files are missing or unreadable, the exporter still writes rollout and history rows.

Output Schema

The generated Parquet file is an event table. It includes one row per rollout event, legacy item, history prompt, or diagnostic log entry.

Important columns:

  • source_kind: rollout, history, or diagnostic_log
  • project: Project name derived from cwd
  • session_id: Codex thread/session identifier
  • item_index: Event index within its source
  • timestamp: ISO timestamp when available
  • rollout_path: Source rollout file path
  • top_level_type: Current JSONL top-level type, such as session_meta, event_msg, response_item, or turn_context
  • event_type: Nested event type for event_msg payloads
  • item_type: Response item type, such as message, reasoning, function_call, or function_call_output
  • role, name, status, call_id, item_id, turn_id: Common message and tool-call identifiers
  • text: The primary readable body for messages, user prompts, tool results, agent messages, and diagnostics
  • tool_input_json, tool_output: Tool/function call inputs and decoded outputs
  • model, model_provider, reasoning_effort, cwd, title, source, cli_version: Thread/session metadata
  • approval_mode, sandbox_policy, tokens_used, git_sha, git_branch, git_origin_url: Execution metadata from state_5.sqlite
  • input_tokens, cached_input_tokens, output_tokens, reasoning_output_tokens, total_tokens: Token usage when present in event payloads
  • rate_limits_json, metadata_json, content_json, payload_json, raw_json: Metadata and raw JSON preservation columns

All Parquet columns are written as strings to keep the schema stable across Codex log format changes. Rare or source-specific details, such as diagnostic log module paths, dynamic tools, and subagent metadata, are preserved in metadata_json instead of becoming mostly-empty top-level columns.

Options

  • --output <file>, -o <file>: Output parquet filename (default: codex_logs.parquet)
  • --project <path>: Filter logs to a specific project directory
  • --all: Export logs from all Codex projects
  • --codex-dir <path>: Codex data directory (default: ~/.codex)
  • --no-history: Skip prompt history rows
  • --no-diagnostics: Skip diagnostic log rows
  • --help, -h: Show help message

Requirements

  • Node.js 22.5.0 or newer. SQLite enrichment uses native node:sqlite; no sqlite3 CLI is required.
  • Codex local data in ~/.codex

Use Cases

  • Analyzing Codex usage patterns across projects
  • Building datasets from human-agent coding sessions
  • Auditing tool calls, command outputs, and runtime diagnostics
  • Creating dashboards over models, projects, token usage, and git branches

Hyperparam

Hyperparam is a tool for exploring and curating AI datasets, such as those produced by codex2parquet.

About

Convert codex logs into a parquet dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors