Skip to content

feat: add offline OSV-to-catalog generator (tools/osvcatalog)#45

Open
adel-pplx wants to merge 4 commits into
mainfrom
osv-catalog-import
Open

feat: add offline OSV-to-catalog generator (tools/osvcatalog)#45
adel-pplx wants to merge 4 commits into
mainfrom
osv-catalog-import

Conversation

@adel-pplx
Copy link
Copy Markdown
Collaborator

@adel-pplx adel-pplx commented May 29, 2026

Adds tools/osvcatalog, an offline generator that converts a locally-downloaded OSV snapshot into a Bumblebee exposure catalog. The scanner never contacts osv.dev at scan time; this is a maintainer-side import that produces a static catalog reviewed and committed like any other under threat_intel/.

Only malicious-package records (MAL- ids, or records aliased to one) are emitted, with severity: "critical". OSV ecosystems (npm, PyPI, Go, RubyGems, Packagist, VSCode) map to Bumblebee's (with VSCode → editor-extension), using OSV's enumerated affected[].versions. Records whose only version information is a range (no enumerated versions) are skipped, since v0.1 matches exact versions only — documented as the main coverage limit.

Two input shapes are supported and documented in threat_intel/README.md:

  • the OSSF malicious-packages repo via --sparse clone (recommended; all ecosystems in one tree);
  • the per-ecosystem OSV all.zip archives.

Conversion logic lives in internal/osv; the CLI under tools/ is thin. Output is deterministic, validates against the published exposure-catalog schema, and is consumed by bumblebee scan --exposure-catalog. Zero new dependencies.

Verified on the full OSSF corpus (226,427 records → 22,365 entries; byte-identical across reruns).

Closes #21

Converts a locally-downloaded OSV snapshot into a Bumblebee exposure
catalog. The scanner never contacts osv.dev; this is an offline,
maintainer-side import that produces a static catalog reviewed and
committed like any other under threat_intel/.

By default only malicious-package records (MAL- ids) are emitted;
-include-vulns widens to all OSV records. OSV ecosystems (npm, PyPI, Go,
RubyGems, Packagist) map to Bumblebee's, using OSV's enumerated
affected[].versions. Range-only entries (no enumerated versions) are
skipped, since v0.1 matches exact versions only.

Closes #21
@adel-pplx adel-pplx requested a review from kyle-pplx May 29, 2026 20:12
adel-pplx added 3 commits June 1, 2026 21:56
…, scope

Local hardening pass on tools/osvcatalog and internal/osv:

- M1: defensive post-read length check in loadZip, mirroring the
  existing guard on plain .json reads (belt-and-suspenders against
  malformed archives; archive/zip normally errors first).
- M2: map OSV "VSCode" ecosystem to Bumblebee "editor-extension"
  (recovers 17 records from the OSSF malicious-packages corpus that
  were previously skipped under SkippedEcosystem).
- M3: hard-code severity="critical" on every emitted entry so the
  generated catalog matches the dominant severity used by hand-curated
  malicious-package catalogs under threat_intel/.
- L1: id-shape regex guard before embedding the OSV id in the Source
  URL; rejects whitespace, control characters, ?, #, %, @, and other
  out-of-set bytes (host is always osv.dev regardless).
- L3: skip-reason counts (no-versions, unsupported-ecosystem, withdrawn,
  not-malicious, bad-id) folded into the deterministic _comment for
  reproducibility audits.
- L5: optional -source flag stamps an upstream provenance label
  (e.g. "github.com/ossf/malicious-packages@<sha>") into _comment.

Scope tightening: dropped -include-vulns flag and Options.IncludeVulns;
the importer emits malicious-package entries only, in line with the
catalog format's purpose.

Robustness: readLimited helper treats -max-file-size <= 0 as
"unbounded" (matching internal/exposure.LoadFile semantics) instead of
silently truncating to 1 byte and dropping every record.

threat_intel/README.md: rewritten to document both supported input
shapes side by side - the OSSF malicious-packages sparse-checkout
(recommended, all ecosystems) and the per-ecosystem OSV all.zip
alternative.

Tests added/updated:
- TestLoadZipPostReadSizeGuard renamed to
  TestLoadZipRejectsOversizedEntry with an honest comment about which
  guard it exercises.
- TestMaxFileSizeZeroIsUnbounded covers the new unbounded semantics.
- TestVSCodeMapsToEditorExtension, TestSkipBadIDShape,
  TestSeverityCritical, TestConvertDropsNonMaliciousVuln,
  TestBuildCatalogCommentDeterministic (extended), TestRunSourceFlag.

Verified on the full OSSF corpus (226,427 records → 22,365 entries;
deterministic across reruns; npm 16,505 / pypi 4,873 / rubygems 969 /
editor-extension 17 / packagist 1). Generated catalog and all existing
threat_intel/ catalogs validate against
docs/schema/v0.1.0/exposure-catalog.schema.json. End-to-end scanner
test against a planted malicious package emits a package_exposure
finding with severity:"critical" and a valid source URL.

Closes #21
…ame skip

- threat_intel/README.md: "roughly half" was wrong by ~9x. The
  range-only path drops the large majority (~90% of the current OSSF
  corpus); state the right number.
- internal/osv/osv.go: comment the empty-package-name fallthrough so
  it's not the one drop path that's silently uncounted in Stats.
- tools/osvcatalog/main_test.go: drop a stale review-milestone label
  from a test comment.
If Convert ever returned zero entries, the prior t.Fatalf would panic
with index-out-of-range instead of failing cleanly. Print the whole
entries slice (matching sibling tests in this file).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add osv.dev as package source

1 participant