Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 42 additions & 21 deletions docs/source/explanation/scancode-license-detection.rst
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
.. _scancode-license-detection:

ScanCode license detection
ScanCode License Detection
==========================

For license detection, ScanCode uses a (large) number of license texts and license detection
'rules' that are compiled in a search index. When scanning, the text of the target file is
extracted and used to query the license search index and find license matches.
For license detection, ScanCode uses a large number of license texts and license detection
rules that are compiled into a search index. When scanning, the text of the target file is
extracted and used to query the license search index to find license matches.

For copyright detection, ScanCode uses a grammar that defines the most common and less common
forms of copyright statements. When scanning, the target file text is extracted and 'parsed'
with this grammar to extract copyright statements.
forms of copyright statements. When scanning, the target file text is extracted and parsed
using this grammar to identify copyright statements.

ScanCode-Toolkit performs the scan on a codebase in the following steps :
ScanCode Toolkit performs the scan on a codebase in the following steps:

1. Collect an inventory of the code files and classify the code using file types,
2. Extract files from any archive using a general purpose extractor
3. Extract texts from binary files if needed
4. Use an extensible rules engine to detect open source license text and notices
5. Use a specialized parser to capture copyright statements
6. Identify packaged code and collect metadata from packages
7. Report the results in the formats of your choice (JSON, CSV, etc.) for integration
with other tools
1. Collect an inventory of the code files and classify them using file types.
2. Extract files from any archive using a general-purpose extractor.
3. Extract text from binary files if needed.
4. Use an extensible rules engine to detect open source license text and notices.
5. Use a specialized parser to capture copyright statements.
6. Identify packaged code and collect metadata from packages.
7. Report the results in the format of your choice (JSON, CSV, etc.) for integration
with other tools.

Scan results are provided in various formats:

* a JSON file simple or pretty-printed,
* SPDX tag value or XML, RDF formats,
* JSON (simple or pretty-printed),
* SPDX tag-value, XML, or RDF formats,
* CSV,
* a simple unformatted HTML file that can be opened in browser or as a spreadsheet.
* a simple unformatted HTML file that can be opened in a browser or as a spreadsheet.

For each scanned file, the result contains:

Expand All @@ -37,7 +37,28 @@ For each scanned file, the result contains:
scanned file, and
* reference information for the detected license.

For archive extraction, ScanCode uses a combination of Python modules, 7zip and libarchive/bsdtar
to detect archive types and extract these recursively.
Text Normalization During License Matching
------------------------------------------

Several other utility modules are used such as libmagic for file and mime type detection.
Before comparing file text against the license search index, ScanCode applies basic
normalization to improve matching accuracy. This ensures that minor textual variations
do not prevent a valid license from being detected.

Normalization includes the following:

* **Whitespace handling**: Extra spaces, tabs, and line breaks are normalized so that
differences in formatting do not affect the matching result.

* **Punctuation handling**: Minor variations in punctuation — such as differences in
the use of commas, periods, or hyphens — are accounted for during matching, so that
slightly reformatted license text can still be recognized correctly.

* **Case normalization**: Text is compared in a case-insensitive manner, so differences
in capitalization do not prevent a match.

These normalization steps make license detection more robust across real-world codebases,
where license text may appear in varying styles and formats.

For archive extraction, ScanCode uses a combination of Python modules, 7zip, and
libarchive/bsdtar to detect archive types and extract them recursively.
Several other utility modules are used, such as libmagic for file and MIME type detection.