Skip to content

Latest commit

 

History

History
642 lines (528 loc) · 18 KB

File metadata and controls

642 lines (528 loc) · 18 KB

GitHub Data Model — Markdown Format

Sync a GitHub repository's issues, pull requests, releases, and metadata into a local github-data/ folder. The result is a plain-text archive that an agent can read, grep, and reason about without API calls.

Directory Structure

github-data/
  repo.yml                   # repository metadata + sync state
  labels.yml                 # all repository labels
  milestones.yml             # all milestones
  issues/
    0001.md                  # issue or PR — one file per number
    0002.md
    0042.md
    0043.md
  projects/                  # Projects v2 boards linked to this repo
    0001.md                  # project file — one per open project
  discussions/               # GitHub Discussions
    0007.md                  # one file per discussion
  releases/
    v1.0.0.md
    v1.2.0.md
  events/                    # event files exported since last sync (for agents to pick up)
    20240916-140000-000-issue_closed-42.md

Issues and pull requests share a single number space (as on GitHub). The filename is the zero-padded number. A file is self-contained: open it and you see the full thread.

repo.yml

Repository-level metadata and sync cursor. Feature flags and archived are always written (even when false) so agents can grep for them; other fields are omitted when empty.

owner: acme
repo: widgets
default_branch: main
description: A widget catalog
homepage: https://widgets.example
visibility: public
language: Go
license: MIT License
topics:
  - cli
  - golang
archived: false
has_issues: true
has_projects: true
has_wiki: true
has_pages: false
has_discussions: false
created_at: 2024-01-01T00:00:00Z
updated_at: 2024-09-15T08:00:00Z
pushed_at: 2024-09-17T07:55:00Z
synced_at: 2024-09-17T08:00:00Z

labels.yml

- name: bug
  color: d73a4a
  description: Something isn't working

- name: enhancement
  color: a2eeef
  description: New feature or request

- name: priority/high
  color: b60205

milestones.yml

- title: v2.1
  state: closed
  description: Stability release
  due_on: 2024-10-01
  closed_at: 2024-09-28

- title: v3.0
  state: open
  description: Major redesign
  due_on: 2025-03-01

Issue File

YAML frontmatter holds structured metadata. The markdown body is the issue text. Comments, reviews, and events follow as additional YAML documents separated by ---.

---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - hubot
labels:
  - bug
  - priority/high
milestone: v2.1
---

When passing an empty string to `parse()`, the application crashes with a null
pointer exception.

## Steps to reproduce

1. Call `parse("")`
2. Observe crash

Pull Request File

Same format. The type: pull_request field and PR-specific frontmatter fields distinguish it from an issue.

---
number: 43
title: Handle empty input in parser
type: pull_request
state: closed
created_at: 2024-09-15T12:00:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - octocat
labels:
  - bugfix
milestone: v2.1
source_branch: fix/empty-input
target_branch: main
merge:
  merged: true
  merged_at: 2024-09-16T14:00:00Z
  merged_by: hubot
  commit_sha: abc123f
reviewers:
  - hubot
requested_reviewers:
  - monalisa
---

Fixes #42. Adds a guard clause to `parse()` to return early on empty input.

Subsequent Documents

After the first document, each --- starts a new document. The document field declares its type. All documents appear in chronological order.

comment

---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---

I can reproduce this. The guard clause was removed in the last refactor.

review

---
document: review
id: 200
author: hubot
state: approved
commit_sha: abc123f
submitted_at: 2024-09-16T10:00:00Z
---

Looks good. The early return is clean.

review_comment

Inline code comment tied to a file, line, and review.

---
document: review_comment
id: 201
review_id: 200
author: hubot
created_at: 2024-09-16T10:00:00Z
path: src/parser.js
line: 12
side: RIGHT
commit_sha: abc123f
---

Nit: could use `=== undefined` instead of `== null` for clarity.

event

State changes. Usually no body.

---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: bug
---

Common event types: labeled, unlabeled, assigned, unassigned, closed, reopened, merged, renamed, milestoned, demilestoned, referenced, cross-referenced, review_requested, review_request_removed, review_dismissed, head_ref_force_pushed, head_ref_deleted, base_ref_changed, converted_to_draft, ready_for_review, locked, unlocked, pinned, unpinned, transferred, connected, disconnected, marked_as_duplicate, unmarked_as_duplicate.

Event-specific fields are added flat in frontmatter:

Event type Extra fields
labeled/unlabeled label
assigned/unassigned assignee
milestoned/demilestoned milestone
renamed from, to
closed/merged/referenced commit_sha
cross-referenced source_number, source_repo
review_requested/review_request_removed reviewer
locked lock_reason
review_dismissed dismissal_message

Project File

One file per open Projects v2 board linked to the repository, named by the project number (projects/0001.md). The frontmatter holds the project header and field definitions; the body is the project's readme; each linked item follows as an item sub-document with its current field values.

Closed projects are not written — when a project transitions from open to closed, its file is deleted and a project_closed event is emitted. Draft issues (project-only items without an issue number) are skipped.

---
number: 1
title: Q1 Roadmap
state: open
public: true
url: https://github.com/orgs/acme/projects/1
owner: acme
description: Quarterly planning
created_at: 2024-01-01T00:00:00Z
updated_at: 2024-09-15T08:00:00Z
fields:
  - name: Status
    type: SINGLE_SELECT
    options:
      - Todo
      - In Progress
      - Done
  - name: Priority
    type: SINGLE_SELECT
    options:
      - P0
      - P1
  - name: Iteration
    type: ITERATION
---

Long-form project description / readme.

---
document: item
type: issue
number: 42
title: Fix crash on empty input
repo: acme/widgets
fields:
  Priority: P0
  Status: In Progress
---

---
document: item
type: pull_request
number: 43
title: Handle empty input in parser
repo: acme/widgets
fields:
  Status: Done
---

Discussion File

One file per GitHub Discussion at discussions/<number>.md. Discussions share the repository's number space with issues and PRs (so a repo can have an issue #42 or a discussion #42, never both). YAML frontmatter holds the metadata; top-level replies are emitted as document: comment and nested replies as document: reply with a parent_id pointing at the comment they reply to.

Discussions are GraphQL-only and use the same since cutoff as the rest of the exporter — the list is fetched newest-first and pagination stops as soon as items are older than the cutoff.

---
number: 7
title: How do I export the wiki?
type: discussion
state: open
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
author: octocat
category: Q&A
labels:
  - question
answer_id: 17024104
answer_chosen_at: 2024-09-16T14:00:00Z
answer_chosen_by: hubot
---

Discussion body markdown.

---
document: comment
id: 17024104
author: hubot
created_at: 2024-09-15T11:00:00Z
is_answer: true
---

Top-level reply that was marked as the chosen answer.

---
document: reply
id: 17024105
parent_id: 17024104
author: octocat
created_at: 2024-09-15T11:30:00Z
---

Nested reply under the top-level comment.

Discussion frontmatter

Field Type Notes
number integer Required. Unique within the repo (shared with issues/PRs)
title string Required
type string Always discussion
state string open or closed
state_reason string outdated, duplicate, resolved, reopened (lowercased)
locked boolean Omit if false
created_at ISO-8601 Required
updated_at ISO-8601 Required
closed_at ISO-8601 Present when closed
author string GitHub username
category string Discussion category name (e.g. Q&A, General, Ideas)
labels string list Label names
answer_id integer Q&A only: databaseId of the comment marked as answer
answer_chosen_at ISO-8601 Q&A only
answer_chosen_by string Q&A only: username who marked the answer

Discussion sub-documents

document Fields
comment id, author, created_at, optional is_answer: true
reply id, parent_id, author, created_at

If a discussion has more than 100 top-level comments, or any comment has more than 50 replies, the export keeps only the first N entries and logs a warning (Warning: discussion #N has more than 100 top-level comments — only first 100 exported). This is a deliberate trade-off to keep GraphQL node cost bounded.

Cross-references

When an issue or PR is on one or more projects, the issue file's frontmatter also lists them:

projects:
  - Q1 Roadmap
  - Bugs

This is populated on the next sync that re-fetches the issue (an issue gets re-fetched when its updated_at advances, which happens whenever it is added to or removed from a project).

Release File

---
tag: v1.0.0
name: Version 1.0.0
draft: false
prerelease: false
author: octocat
created_at: 2024-06-01T12:00:00Z
published_at: 2024-06-01T12:00:00Z
target_commitish: main
assets:
  - name: app-v1.0.0-linux-amd64.tar.gz
    content_type: application/gzip
    size_bytes: 12345678
    download_count: 542
---

## What's New

- Initial stable release
- Full parser support
- CLI interface

Complete Example: issues/0042.md

A full issue file showing the chronological thread.

---
number: 42
title: Fix crash on empty input
state: closed
state_reason: completed
created_at: 2024-09-15T10:30:00Z
updated_at: 2024-09-16T14:00:00Z
closed_at: 2024-09-16T14:00:00Z
author: octocat
assignees:
  - hubot
labels:
  - bug
  - priority/high
milestone: v2.1
---

When passing an empty string to `parse()`, the application crashes with a null
pointer exception.

---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: bug
---

---
document: event
event: labeled
actor: octocat
created_at: 2024-09-15T10:31:00Z
label: priority/high
---

---
document: comment
id: 100
author: hubot
created_at: 2024-09-15T11:00:00Z
---

I can reproduce this. The guard clause was removed in the last refactor.

---
document: event
event: assigned
actor: octocat
created_at: 2024-09-15T11:05:00Z
assignee: hubot
---

---
document: comment
id: 101
author: octocat
created_at: 2024-09-15T14:00:00Z
---

Fixed in PR #43.

---
document: event
event: closed
actor: hubot
created_at: 2024-09-16T14:00:00Z
commit_sha: abc123f
---

Frontmatter Reference

Issue / Pull Request (first document)

Field Type Notes
number integer Required. Unique within repo
title string Required
type string pull_request if PR, omit for issues
state string open or closed
state_reason string completed, not_planned, reopened
locked boolean Omit if false
created_at ISO-8601 Required
updated_at ISO-8601 Required
closed_at ISO-8601 Present when closed
author string GitHub username
assignees string list Usernames
labels string list Label names
milestone string Milestone title
projects string list Projects v2 boards the item is on
reactions map {"+1": 2, "heart": 1}, omit if none

PR-only fields (when type: pull_request)

Field Type Notes
draft boolean Omit if false
source_branch string
target_branch string
source_repo string Only for cross-repo PRs
merge.merged boolean
merge.merged_at ISO-8601
merge.merged_by string Username
merge.commit_sha string
reviewers string list Completed reviewers
requested_reviewers string list Pending reviewers

Subsequent documents

Field Type Notes
document string Required. comment, review, review_comment, event
id integer Required for comments and reviews
author string For comments/reviews
actor string For events
created_at ISO-8601 Required

Type-specific fields are added flat — see examples above.

Agent Usage

This format is designed so an agent with standard file tools (read, glob, grep) can work with GitHub data without API access.

Find open bugs:

grep -l "state: open" github-data/issues/*.md | xargs grep -l "bug"

Read a specific issue thread:

cat github-data/issues/0042.md

Find issues mentioning a file:

grep -rl "parser.js" github-data/issues/

Find all PRs merged to main:

grep -l "target_branch: main" github-data/issues/*.md | xargs grep -l "merged: true"

List releases:

ls github-data/releases/

Check sync freshness:

cat github-data/repo.yml

Sync Behavior

  • Full sync: Uses bulk API endpoints (repo-wide comments, events, PRs, review comments) to fetch all data in a few paginated requests instead of per-issue calls. Only PR reviews require per-PR fetches (no bulk endpoint).
  • Incremental sync: Uses synced_at from repo.yml. Fetches only items updated since last sync via the since parameter. Uses per-issue timeline endpoint for changed issues (gives complete history in one call) plus bulk PR list.
  • Deleted items: GitHub doesn't hard-delete issues. Transferred or spam-deleted issues are left as-is (the state and timeline tell the story).
  • File naming: Zero-padded to 4 digits (0042.md). Repos with >9999 issues use 5+ digits.
  • Idempotent: Running sync twice produces the same files. Safe to re-run.

Design Decisions

Why github-data/ inside the repo? The agent already has the repo checked out. Colocating the data means no extra paths to configure. Add github-data/ to .gitignore if you don't want it committed.

Why one file per issue? An agent can read a single file to get the full picture. Grep works across all issues. No database, no joins, no query language.

Why multi-document markdown? The thread reads top-to-bottom like a conversation. Frontmatter is parseable; the body is readable. Standard YAML parsers handle multi-document streams.

Why usernames instead of user objects? Keeps files readable and greppable. A username is enough to identify who did what. Full user profiles (email, avatar) are rarely needed for reasoning.

Why flat event fields? label: bug is simpler than label: { name: bug, color: d73a4a }. The label details live in labels.yml if you need them.

Why chronological order? Events and comments interleaved in time order tell the story of what happened. An agent can read top-to-bottom without sorting.