Skip to content

Adapt Extraction of PyPI Metadata for PEP 639 Support#729

Open
yashkohli88 wants to merge 1 commit intoclearlydefined:masterfrom
yashkohli88:yk/pypi-pep639-license-expression
Open

Adapt Extraction of PyPI Metadata for PEP 639 Support#729
yashkohli88 wants to merge 1 commit intoclearlydefined:masterfrom
yashkohli88:yk/pypi-pep639-license-expression

Conversation

@yashkohli88
Copy link
Contributor

Description

Summary

This PR adds support for PEP 639 license_expression field in the PyPI metadata extraction logic. PEP 639 introduces a standardized SPDX license_expression field, deprecating the legacy free-form license text and license classifiers.

Problem

PyPI packages are transitioning to PEP 639, which introduces info.license_expression containing proper SPDX license expressions. The crawler currently does not recognize this new field, potentially missing accurate license data from packages that have already adopted the new standard.

Examples:

  • Legacy format: requests — uses info.license free-form text
  • PEP 639 format: packaging — uses info.license_expression SPDX expression

Solution

Extends the license extraction logic to prioritize the new license_expression field while maintaining full backward compatibility with existing detection.


Changes

providers/fetch/pypiFetch.js

  • Added _extractLicenseExpression(registryData) — New method to extract and validate info.license_expression from PyPI registry data. Returns the expression string if valid, otherwise null.

  • Updated _extractDeclaredLicense(registryData) — Now checks for license_expression first via _extractLicenseExpression(), falling back to existing logic (free-form info.license → license classifiers → SPDX normalization) when unavailable.

Test Summary

Test Suite Test Case Expected
extractLicenseExpression license_expression is present ('MIT') Returns 'MIT'
extractLicenseExpression Compound SPDX expression ('MIT AND Apache-2.0') Returns as-is
extractLicenseExpression Expression with WITH clause Returns as-is
extractLicenseExpression license_expression is missing Returns null
extractLicenseExpression license_expression is null Returns null
extractLicenseExpression license_expression is empty string Returns null
extractLicenseExpression license_expression is not a string (123) Returns null
extractDeclaredLicense Prioritizes license_expression over info.license Uses license_expression
extractDeclaredLicense Prioritizes license_expression over classifiers Uses license_expression
extractDeclaredLicense Falls back to info.license when license_expression missing Uses info.license

Backward Compatibility

Scenario Behavior
Package has license_expression (PEP 639) ✅ Uses license_expression directly
Package has only info.license (legacy) ✅ Falls back to existing extraction logic
Package has only license classifiers ✅ Falls back to classifier-based extraction
Package has both old and new fields ✅ Prioritizes license_expression
license_expression is empty/null/invalid ✅ Falls back gracefully

Testing

  • Unit tests for _extractLicenseExpression() with valid SPDX expressions
  • Unit tests for _extractLicenseExpression() with null/empty/invalid values
  • Unit tests for compound expressions (AND, OR, WITH)
  • Integration tests for priority logic in _extractDeclaredLicense()
  • Regression — all existing test cases continue to pass

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant