The current codebase lacks static type checking, creating critical risks for:
• Data pipeline integrity: Type errors in fetch scripts corrupt downstream processing
• API integration reliability: External APIs return unstructured license formats requiring type validation.
• Silent failures: Runtime type errors go undetected until data analysis phase
Example of some of the common CC license formats returned by the arXiv API response
# arXiv license format variations that cause runtime failures:
license_examples = [
"http://creativecommons.org/licenses/by/4.0/", # Full URL
"CC BY 4.0", # Short form
"Creative Commons Attribution 4.0 International", # Full name
"cc-by-4.0", # Lowercase with hyphens
"", # Empty string
None, # None value
["CC BY 4.0", "http://creativecommons.org/licenses/by/4.0/"] # List format
]
A CASE FOR THE ADOPTION OF MYPY AS TYPE CHECKER
Project Links:
• Project Repository: https://github.com/creativecommons/quantifying
• Creative Commons Python Guidelines: https://opensource.creativecommons.org/
• mypy Documentation: https://mypy.readthedocs.io/
• mypy GitHub: https://github.com/python/mypy
• Type Checking PEP 484: https://peps.python.org/pep-0484/
Why mypy Over Alternatives like pyright, pyre, pytype
mypy vs pyright:
• Zero Dependencies: mypy is pure Python; pyright requires 200MB+ Node.js runtime
• CI Efficiency: Native Python integration vs additional Node.js setup in GitHub Actions
• Error Quality: mypy provides actionable messages for data pipeline debugging
• Library Ecosystem: Superior third-party stub support for pandas/requests/matplotlib which are already adopted in the project
mypy vs pyre:
• Active Development: mypy has 50+ contributors; pyre development stalled (last major release 18+ months)
• Incremental Analysis: mypy supports file-by-file checking; pyre requires full project analysis
• Scientific Python: Better numpy/pandas type support crucial for data processing
mypy vs pytype:
• Explicit Contracts: mypy requires explicit annotations documenting API expectations; pytype's inference misses contract violations
• Error Detection: mypy catches 60% more type errors in data transformation code
• Union Type Support: Superior handling of multiple license format variations
Project-Specific Advantages
Quantifying Commons Integration:
• Seamless Toolchain: Integrates with existing black/flake8/isort workflow already adopted in the project
• License Normalization: Strict typing prevents license format corruption in normalizing_license_text()
• API Reliability: Optional/Union types handle inconsistent arXiv/GitHub API responses
• Data Integrity: Catches type mismatches before they corrupt quarterly reports
Development Workflow:
• Gradual Adoption: Start with critical functions, expand incrementally
• Configuration Consistency: Uses mypy.ini following project's tool-specific config pattern
• Python 3.11 Native: Full compatibility with current project version
Implementation Plan
- Add
mypy to Pipfile dev-packages
- Create
mypy.ini configuration file
- Update
.pre-commit-config.yaml with mypy hook
- Add mypy to
.github/workflows/static_analysis.yml
- Type annotate
scripts/1-fetch/arxiv_fetch.py in normalizing_license_text() function
Acceptance Criteria
• [ ] mypy runs successfully on all Python files
• [ ] Pre-commit hooks include mypy validation
• [ ] GitHub Actions workflow includes mypy check
• [ ] Core data pipeline functions have type annotations
• [ ] Documentation updated with mypy usage instructions
Priority: Medium - Prevents data corruption in quarterly CC commons reports
Implementation
The current codebase lacks static type checking, creating critical risks for:
• Data pipeline integrity: Type errors in fetch scripts corrupt downstream processing
• API integration reliability: External APIs return unstructured license formats requiring type validation.
• Silent failures: Runtime type errors go undetected until data analysis phase
Example of some of the common CC license formats returned by the arXiv API response
A CASE FOR THE ADOPTION OF MYPY AS TYPE CHECKER
Project Links:
• Project Repository: https://github.com/creativecommons/quantifying
• Creative Commons Python Guidelines: https://opensource.creativecommons.org/
• mypy Documentation: https://mypy.readthedocs.io/
• mypy GitHub: https://github.com/python/mypy
• Type Checking PEP 484: https://peps.python.org/pep-0484/
Why mypy Over Alternatives like pyright, pyre, pytype
mypy vs pyright:
• Zero Dependencies: mypy is pure Python; pyright requires 200MB+ Node.js runtime
• CI Efficiency: Native Python integration vs additional Node.js setup in GitHub Actions
• Error Quality: mypy provides actionable messages for data pipeline debugging
• Library Ecosystem: Superior third-party stub support for pandas/requests/matplotlib which are already adopted in the project
mypy vs pyre:
• Active Development: mypy has 50+ contributors; pyre development stalled (last major release 18+ months)
• Incremental Analysis: mypy supports file-by-file checking; pyre requires full project analysis
• Scientific Python: Better numpy/pandas type support crucial for data processing
mypy vs pytype:
• Explicit Contracts: mypy requires explicit annotations documenting API expectations; pytype's inference misses contract violations
• Error Detection: mypy catches 60% more type errors in data transformation code
• Union Type Support: Superior handling of multiple license format variations
Project-Specific Advantages
Quantifying Commons Integration:
• Seamless Toolchain: Integrates with existing black/flake8/isort workflow already adopted in the project
• License Normalization: Strict typing prevents license format corruption in
normalizing_license_text()• API Reliability: Optional/Union types handle inconsistent arXiv/GitHub API responses
• Data Integrity: Catches type mismatches before they corrupt quarterly reports
Development Workflow:
• Gradual Adoption: Start with critical functions, expand incrementally
• Configuration Consistency: Uses mypy.ini following project's tool-specific config pattern
• Python 3.11 Native: Full compatibility with current project version
Implementation Plan
mypyto Pipfile dev-packagesmypy.iniconfiguration file.pre-commit-config.yamlwith mypy hook.github/workflows/static_analysis.ymlscripts/1-fetch/arxiv_fetch.pyinnormalizing_license_text()functionAcceptance Criteria
• [ ] mypy runs successfully on all Python files
• [ ] Pre-commit hooks include mypy validation
• [ ] GitHub Actions workflow includes mypy check
• [ ] Core data pipeline functions have type annotations
• [ ] Documentation updated with mypy usage instructions
Priority: Medium - Prevents data corruption in quarterly CC commons reports
Implementation