Skip to content

Avoid false copyright detection on unicode or non-meaningful text#4855

Open
kushal-p16 wants to merge 1 commit intoaboutcode-org:developfrom
kushal-p16:patch-4
Open

Avoid false copyright detection on unicode or non-meaningful text#4855
kushal-p16 wants to merge 1 commit intoaboutcode-org:developfrom
kushal-p16:patch-4

Conversation

@kushal-p16
Copy link

Summary

This PR reduces false positives in copyright detection caused by unicode or non-meaningful text.

Problem

In certain files (e.g., unicode or binary-like data), non-readable text was incorrectly interpreted as copyright content. For example, blocks of unicode characters were being parsed and resulted in outputs such as:

copyright: (c) $?i (c) Y

This creates noise and reduces the accuracy of detection.

Solution

A lightweight filtering step has been added in the tokenization stage (get_tokens()), which skips lines that do not contain any alphabetic characters.

This ensures that:

  • Non-readable or unicode-heavy lines are ignored early
  • Only meaningful textual content is processed
  • Existing valid detections are unaffected

Impact

  • Improves accuracy of copyright detection
  • Reduces noise from binary/unicode data
  • Keeps implementation simple and efficient

Fixes #4381

…ght detection

Skip non-meaningful unicode lines to reduce false positives in copyright detection

Some unicode or binary-like text was incorrectly being processed during tokenization,
leading to false copyright detections.

This change adds a lightweight filter in get_tokens() to skip lines that do not
contain any alphabetic characters. This prevents non-readable or unicode-heavy
content from being parsed, while keeping existing valid detections unaffected.

Fixes aboutcode-org#4381

Signed-off-by: KUSHAL P <kushalmys55@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential noise from copyright detection

1 participant