Avoid false copyright detection on unicode or non-meaningful text by kushal-p16 · Pull Request #4855 · aboutcode-org/scancode-toolkit

kushal-p16 · 2026-03-20T15:17:08Z

Summary

This PR reduces false positives in copyright detection caused by unicode or non-meaningful text.

Problem

In certain files (e.g., unicode or binary-like data), non-readable text was incorrectly interpreted as copyright content. For example, blocks of unicode characters were being parsed and resulted in outputs such as:

copyright: (c) $?i (c) Y

This creates noise and reduces the accuracy of detection.

Solution

A lightweight filtering step has been added in the tokenization stage (get_tokens()), which skips lines that do not contain any alphabetic characters.

This ensures that:

Non-readable or unicode-heavy lines are ignored early
Only meaningful textual content is processed
Existing valid detections are unaffected

Impact

Improves accuracy of copyright detection
Reduces noise from binary/unicode data
Keeps implementation simple and efficient

Fixes #4381

…ght detection Skip non-meaningful unicode lines to reduce false positives in copyright detection Some unicode or binary-like text was incorrectly being processed during tokenization, leading to false copyright detections. This change adds a lightweight filter in get_tokens() to skip lines that do not contain any alphabetic characters. This prevents non-readable or unicode-heavy content from being parsed, while keeping existing valid detections unaffected. Fixes aboutcode-org#4381 Signed-off-by: KUSHAL P <kushalmys55@gmail.com>

mstykow mentioned this pull request Apr 2, 2026

Reproduce BusyBox copyright corruption and adjacent noisy-text false positives mstykow/provenant#544

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid false copyright detection on unicode or non-meaningful text#4855

Avoid false copyright detection on unicode or non-meaningful text#4855
kushal-p16 wants to merge 1 commit intoaboutcode-org:developfrom
kushal-p16:patch-4

kushal-p16 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kushal-p16 commented Mar 20, 2026

Summary

Problem

Solution

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant