A command-line tool to convert Codex session logs to Parquet format for data analysis and AI applications.
npm install -g codex2parquet# Export Codex logs for current directory to codex_logs.parquet
codex2parquet
# Export logs from all projects
codex2parquet --all
# Export to custom filename
codex2parquet --output logs.parquet
# Export logs for a specific project directory
codex2parquet --project ~/code/myapp
# Read from a non-default Codex data directory
codex2parquet --codex-dir ~/.codexCodex stores local data under ~/.codex by default. This tool reads:
~/.codex/sessions/**/*.jsonl: current Codex rollout logs. Each line is a JSON object withtimestamp,type, andpayload.~/.codex/sessions/rollout-*.json: legacy rollout logs. Each file contains asessionobject and anitemsarray.~/.codex/state_5.sqlite: thread metadata, including cwd, title, model, model provider, CLI version, sandbox policy, approval mode, token totals, git metadata, dynamic tools, and subagent parent/child edges.~/.codex/history.jsonl: prompt history rows withsession_id, Unix timestamp, and text.~/.codex/logs_2.sqlite: diagnostic/runtime log rows when the current Node.js runtime includesnode:sqlite.
The SQLite sources are optional. The exporter reads them through Node's native node:sqlite module and does not require a system sqlite3 command. If the SQLite files are missing or unreadable, the exporter still writes rollout and history rows.
The generated Parquet file is an event table. It includes one row per rollout event, legacy item, history prompt, or diagnostic log entry.
Important columns:
source_kind:rollout,history, ordiagnostic_logproject: Project name derived fromcwdsession_id: Codex thread/session identifieritem_index: Event index within its sourcetimestamp: ISO timestamp when availablerollout_path: Source rollout file pathtop_level_type: Current JSONL top-level type, such assession_meta,event_msg,response_item, orturn_contextevent_type: Nested event type forevent_msgpayloadsitem_type: Response item type, such asmessage,reasoning,function_call, orfunction_call_outputrole,name,status,call_id,item_id,turn_id: Common message and tool-call identifierstext: The primary readable body for messages, user prompts, tool results, agent messages, and diagnosticstool_input_json,tool_output: Tool/function call inputs and decoded outputsmodel,model_provider,reasoning_effort,cwd,title,source,cli_version: Thread/session metadataapproval_mode,sandbox_policy,tokens_used,git_sha,git_branch,git_origin_url: Execution metadata fromstate_5.sqliteinput_tokens,cached_input_tokens,output_tokens,reasoning_output_tokens,total_tokens: Token usage when present in event payloadsrate_limits_json,metadata_json,content_json,payload_json,raw_json: Metadata and raw JSON preservation columns
All Parquet columns are written as strings to keep the schema stable across Codex log format changes. Rare or source-specific details, such as diagnostic log module paths, dynamic tools, and subagent metadata, are preserved in metadata_json instead of becoming mostly-empty top-level columns.
--output <file>,-o <file>: Output parquet filename (default:codex_logs.parquet)--project <path>: Filter logs to a specific project directory--all: Export logs from all Codex projects--codex-dir <path>: Codex data directory (default:~/.codex)--no-history: Skip prompt history rows--no-diagnostics: Skip diagnostic log rows--help,-h: Show help message
- Node.js 22.5.0 or newer. SQLite enrichment uses native
node:sqlite; nosqlite3CLI is required. - Codex local data in
~/.codex
- Analyzing Codex usage patterns across projects
- Building datasets from human-agent coding sessions
- Auditing tool calls, command outputs, and runtime diagnostics
- Creating dashboards over models, projects, token usage, and git branches
Hyperparam is a tool for exploring and curating AI datasets, such as those produced by codex2parquet.