Skip to content

feat: store multiple repo licenses as array IN-1099#4105

Open
gaspergrom wants to merge 2 commits into
mainfrom
feat/IN-1099-display-repo-license-info
Open

feat: store multiple repo licenses as array IN-1099#4105
gaspergrom wants to merge 2 commits into
mainfrom
feat/IN-1099-display-repo-license-info

Conversation

@gaspergrom
Copy link
Copy Markdown
Contributor

@gaspergrom gaspergrom commented May 12, 2026

Summary

  • Replace single license VARCHAR(255) column with licenses VARCHAR(255)[] array on public.repositories
  • Update git integration license_service.py to return all detected licenses (applying the 98% confidence threshold per-license; returns ["NOASSERTION"] when files exist but none pass, [] when nothing found)
  • Rename update_repository_licenseupdate_repository_licenses in crud.py to write the full array
  • Update IRepository TypeScript type and both SELECT queries to use licenses
  • Add licenses Array(String) to Tinybird repositories.datasource (Sequin sync)
  • Produce repoLicenses Array(Tuple(String, String)) — flat (repoUrl, licenseId) pairs — in insights_projects_populated_copy.pipe via arrayFlatten(groupArray(arrayMap(...)))
  • Add repoLicenses to insights_projects_populated_ds.datasource and expose through insightsProjects_filtered.pipe

Changes

Area Files
Migration V1778600068__removeLicenseAddLicensesToRepositories.sql / U*.sql
Git integration license_service.py, crud.py, repository_worker.py
TypeScript DAL services/libs/data-access-layer/src/repositories/index.ts
Tinybird datasources repositories.datasource, insights_projects_populated_ds.datasource
Tinybird pipes insights_projects_populated_copy.pipe, insightsProjects_filtered.pipe

Ticket

https://linuxfoundation.atlassian.net/browse/IN-1099


Note

Medium Risk
Medium risk due to a Postgres schema change and downstream contract updates (git worker writes, TS DAL reads, Tinybird schemas/pipes) that can break deployments if not migrated and synced in lockstep.

Overview
Stores multiple licenses per repository. Replaces public.repositories.license with licenses (VARCHAR[]) via forward/backward migrations.

Updates git integration license detection to return all SPDX IDs above the confidence threshold (otherwise [] or ["NOASSERTION"]), and writes the array via renamed update_repository_licenses.

Propagates the new licenses field through the TypeScript DAL selects/types and Tinybird ingestion/analytics by adding licenses to the repositories datasource and emitting/exposing repoLicenses (repo URL, license) tuples in populated project pipes.

Reviewed by Cursor Bugbot for commit c471e3b. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
@gaspergrom gaspergrom self-assigned this May 12, 2026
Copilot AI review requested due to automatic review settings May 12, 2026 17:13
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades repository license storage from a single SPDX string to a multi-value array and threads that change through the git integration ingestion path, the TypeScript DAL, and Tinybird analytics so multiple detected licenses can be persisted and queried.

Changes:

  • Replace public.repositories.license with public.repositories.licenses (varchar[]) and update git integration to write detected license arrays.
  • Update DAL repository type + queries to select licenses.
  • Extend Tinybird repository/project datasets and pipes to ingest licenses and expose flattened (repoUrl, licenseId) pairs as repoLicenses.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/libs/tinybird/pipes/insightsProjects_filtered.pipe Exposes repoLicenses in the filtered insights projects output.
services/libs/tinybird/pipes/insights_projects_populated_copy.pipe Pulls repo licenses and produces flattened repoLicenses tuples per project.
services/libs/tinybird/datasources/repositories.datasource Adds licenses Array(String) to the repositories datasource schema.
services/libs/tinybird/datasources/insights_projects_populated_ds.datasource Adds repoLicenses Array(Tuple(String, String)) to populated projects schema.
services/libs/data-access-layer/src/repositories/index.ts Renames repository field to licenses and updates selects accordingly.
services/apps/git_integration/src/crowdgit/worker/repository_worker.py Writes detected licenses array to Postgres during first-batch processing.
services/apps/git_integration/src/crowdgit/services/license/license_service.py Returns a list of SPDX IDs (or [] / ['NOASSERTION']) instead of a single value.
services/apps/git_integration/src/crowdgit/database/crud.py Renames and updates CRUD method to persist licenses array column.
backend/src/database/migrations/V1778600068__removeLicenseAddLicensesToRepositories.sql Drops license column and adds licenses array column.
backend/src/database/migrations/U1778600068__removeLicenseAddLicensesToRepositories.sql Rollback migration to drop licenses and re-add license.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +2
ALTER TABLE public.repositories DROP COLUMN license;
ALTER TABLE public.repositories ADD COLUMN licenses VARCHAR(255)[];
Comment on lines +1 to +2
ALTER TABLE public.repositories DROP COLUMN licenses;
ALTER TABLE public.repositories ADD COLUMN license VARCHAR(255);
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c471e3b. Configure here.

# licensee puts per-file confidence inside each matched_file's matcher object.
confidence_by_spdx: dict[str, float] = {}
for mf in matched_files:
spdx = (mf.get("matched_license") or {}).get("spdx_id") or ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence map may never populate due to field access

Medium Severity

The code accesses (mf.get("matched_license") or {}).get("spdx_id") assuming matched_license is a dict. If the licensee gem serializes matched_license as a plain string (the SPDX ID directly, e.g. "MIT"), the or {} fallback won't trigger (since a non-empty string is truthy), and calling .get("spdx_id") on a string raises AttributeError. This would be caught by the outer except Exception on line 83, causing the function to always return [] — silently disabling license detection for all repositories. Even if matched_license is correctly a dict, the confidence_by_spdx map remains empty if the key differs between matched_files and licenses entries, causing all licenses to bypass confidence filtering.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c471e3b. Configure here.

@gaspergrom gaspergrom requested a review from joanagmaia May 12, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants