feat: add offline OSV-to-catalog generator (tools/osvcatalog)#45
Open
adel-pplx wants to merge 4 commits into
Open
feat: add offline OSV-to-catalog generator (tools/osvcatalog)#45adel-pplx wants to merge 4 commits into
adel-pplx wants to merge 4 commits into
Conversation
Converts a locally-downloaded OSV snapshot into a Bumblebee exposure catalog. The scanner never contacts osv.dev; this is an offline, maintainer-side import that produces a static catalog reviewed and committed like any other under threat_intel/. By default only malicious-package records (MAL- ids) are emitted; -include-vulns widens to all OSV records. OSV ecosystems (npm, PyPI, Go, RubyGems, Packagist) map to Bumblebee's, using OSV's enumerated affected[].versions. Range-only entries (no enumerated versions) are skipped, since v0.1 matches exact versions only. Closes #21
…, scope Local hardening pass on tools/osvcatalog and internal/osv: - M1: defensive post-read length check in loadZip, mirroring the existing guard on plain .json reads (belt-and-suspenders against malformed archives; archive/zip normally errors first). - M2: map OSV "VSCode" ecosystem to Bumblebee "editor-extension" (recovers 17 records from the OSSF malicious-packages corpus that were previously skipped under SkippedEcosystem). - M3: hard-code severity="critical" on every emitted entry so the generated catalog matches the dominant severity used by hand-curated malicious-package catalogs under threat_intel/. - L1: id-shape regex guard before embedding the OSV id in the Source URL; rejects whitespace, control characters, ?, #, %, @, and other out-of-set bytes (host is always osv.dev regardless). - L3: skip-reason counts (no-versions, unsupported-ecosystem, withdrawn, not-malicious, bad-id) folded into the deterministic _comment for reproducibility audits. - L5: optional -source flag stamps an upstream provenance label (e.g. "github.com/ossf/malicious-packages@<sha>") into _comment. Scope tightening: dropped -include-vulns flag and Options.IncludeVulns; the importer emits malicious-package entries only, in line with the catalog format's purpose. Robustness: readLimited helper treats -max-file-size <= 0 as "unbounded" (matching internal/exposure.LoadFile semantics) instead of silently truncating to 1 byte and dropping every record. threat_intel/README.md: rewritten to document both supported input shapes side by side - the OSSF malicious-packages sparse-checkout (recommended, all ecosystems) and the per-ecosystem OSV all.zip alternative. Tests added/updated: - TestLoadZipPostReadSizeGuard renamed to TestLoadZipRejectsOversizedEntry with an honest comment about which guard it exercises. - TestMaxFileSizeZeroIsUnbounded covers the new unbounded semantics. - TestVSCodeMapsToEditorExtension, TestSkipBadIDShape, TestSeverityCritical, TestConvertDropsNonMaliciousVuln, TestBuildCatalogCommentDeterministic (extended), TestRunSourceFlag. Verified on the full OSSF corpus (226,427 records → 22,365 entries; deterministic across reruns; npm 16,505 / pypi 4,873 / rubygems 969 / editor-extension 17 / packagist 1). Generated catalog and all existing threat_intel/ catalogs validate against docs/schema/v0.1.0/exposure-catalog.schema.json. End-to-end scanner test against a planted malicious package emits a package_exposure finding with severity:"critical" and a valid source URL. Closes #21
…ame skip - threat_intel/README.md: "roughly half" was wrong by ~9x. The range-only path drops the large majority (~90% of the current OSSF corpus); state the right number. - internal/osv/osv.go: comment the empty-package-name fallthrough so it's not the one drop path that's silently uncounted in Stats. - tools/osvcatalog/main_test.go: drop a stale review-milestone label from a test comment.
If Convert ever returned zero entries, the prior t.Fatalf would panic with index-out-of-range instead of failing cleanly. Print the whole entries slice (matching sibling tests in this file).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
tools/osvcatalog, an offline generator that converts a locally-downloaded OSV snapshot into a Bumblebee exposure catalog. The scanner never contacts osv.dev at scan time; this is a maintainer-side import that produces a static catalog reviewed and committed like any other underthreat_intel/.Only malicious-package records (
MAL-ids, or records aliased to one) are emitted, withseverity: "critical". OSV ecosystems (npm,PyPI,Go,RubyGems,Packagist,VSCode) map to Bumblebee's (withVSCode → editor-extension), using OSV's enumeratedaffected[].versions. Records whose only version information is a range (no enumerated versions) are skipped, since v0.1 matches exact versions only — documented as the main coverage limit.Two input shapes are supported and documented in
threat_intel/README.md:malicious-packagesrepo via--sparseclone (recommended; all ecosystems in one tree);all.ziparchives.Conversion logic lives in
internal/osv; the CLI undertools/is thin. Output is deterministic, validates against the published exposure-catalog schema, and is consumed bybumblebee scan --exposure-catalog. Zero new dependencies.Verified on the full OSSF corpus (226,427 records → 22,365 entries; byte-identical across reruns).
Closes #21