Avoid false copyright detection on unicode or non-meaningful text#4855
Open
kushal-p16 wants to merge 1 commit intoaboutcode-org:developfrom
Open
Avoid false copyright detection on unicode or non-meaningful text#4855kushal-p16 wants to merge 1 commit intoaboutcode-org:developfrom
kushal-p16 wants to merge 1 commit intoaboutcode-org:developfrom
Conversation
…ght detection Skip non-meaningful unicode lines to reduce false positives in copyright detection Some unicode or binary-like text was incorrectly being processed during tokenization, leading to false copyright detections. This change adds a lightweight filter in get_tokens() to skip lines that do not contain any alphabetic characters. This prevents non-readable or unicode-heavy content from being parsed, while keeping existing valid detections unaffected. Fixes aboutcode-org#4381 Signed-off-by: KUSHAL P <kushalmys55@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR reduces false positives in copyright detection caused by unicode or non-meaningful text.
Problem
In certain files (e.g., unicode or binary-like data), non-readable text was incorrectly interpreted as copyright content. For example, blocks of unicode characters were being parsed and resulted in outputs such as:
This creates noise and reduces the accuracy of detection.
Solution
A lightweight filtering step has been added in the tokenization stage (
get_tokens()), which skips lines that do not contain any alphabetic characters.This ensures that:
Impact
Fixes #4381