Skip to content

🔧 Remove Em Dash from unicode check#74

Merged
dragonfire1119 merged 1 commit into
masterfrom
remove-em-dash-unicode-check
Mar 15, 2026
Merged

🔧 Remove Em Dash from unicode check#74
dragonfire1119 merged 1 commit into
masterfrom
remove-em-dash-unicode-check

Conversation

@dragonfire1119
Copy link
Copy Markdown
Contributor

@dragonfire1119 dragonfire1119 commented Mar 15, 2026

  • Remove Em Dash (U+2014) entry from unicode character detection list
  • Eliminates redundancy with existing dash detection rules

Greptile Summary

This PR removes the Em Dash (U+2014) entry from the harmful_patterns array in the Unicode Security Scanner. While the PR description states this "eliminates redundancy with existing dash detection rules," the is_common_unicode() function referenced is actually a skip-filter (used with --exclude-common), not a detection mechanism. Removing U+2014 from harmful_patterns means Em Dash will no longer be detected at all.

  • Removes the single line "e28094:2014:Em Dash (looks like double hyphen)" from the harmful_patterns array
  • Other dashes (U+2010–U+2013, U+2015) remain in the detection list
  • The comment on line 199 still references "em-dash" in the is_common_unicode() function, which is now unreachable for U+2014 since it's no longer in harmful_patterns
  • If intentionally removing Em Dash detection (e.g., too many false positives in prose), the rationale should be clarified

Confidence Score: 3/5

  • Low-risk single-line removal, but the stated rationale is incorrect — this fully removes Em Dash detection rather than deduplicating it.
  • The change is minimal and won't break anything, but it silently drops detection coverage for a confusable Unicode character (Em Dash). The PR description's claim of "redundancy" is factually wrong — the is_common_unicode() function is a filter, not a detection rule. This warrants author clarification before merging.
  • check-for-unicode/run.sh — verify whether dropping Em Dash detection is the intended behavior

Important Files Changed

Filename Overview
check-for-unicode/run.sh Removes Em Dash (U+2014) from harmful_patterns detection list. The stated rationale of redundancy with existing dash rules is incorrect — is_common_unicode() is a skip-filter, not a detection mechanism. This change fully removes Em Dash detection.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Iterate harmful_patterns] --> B{Is character in allowlist?}
    B -- Yes --> C[Skip]
    B -- No --> D{--exclude-emojis && is_emoji?}
    D -- Yes --> C
    D -- No --> E{--exclude-common && is_common_unicode?}
    E -- Yes --> C
    E -- No --> F[Search file for hex pattern]
    F --> G{Pattern found?}
    G -- Yes --> H[Report finding]
    G -- No --> C

    style A fill:#f9f,stroke:#333
    style E fill:#ff9,stroke:#333
    style H fill:#f66,stroke:#333

    subgraph "Em Dash After This PR"
        I["U+2014 NOT in harmful_patterns"] --> J["Never enters detection loop"]
        J --> K["Never detected regardless of flags"]
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: check-for-unicode/run.sh
Line: 521-522

Comment:
**Em Dash no longer detected at all**

The PR description says this "eliminates redundancy with existing dash detection rules," but this appears to be incorrect. The `is_common_unicode()` function (line 199) does match U+2014 via the `^201[0-5]$` regex, but that function is only a **skip filter** — it tells the scanner to *ignore* common Unicode when `--exclude-common` is passed. It is not a detection rule.

The `harmful_patterns` array is the **only** detection mechanism. By removing U+2014 from it, the Em Dash character will never be detected by this scanner, regardless of flags. The other dash entries (U+2010–U+2013, U+2015) are still present, so only Em Dash specifically loses coverage.

If the intent is to stop flagging Em Dash entirely (e.g., because it's common in prose), this is fine — but the PR description's stated rationale of "redundancy" doesn't hold. Could you clarify whether the goal is to intentionally stop detecting Em Dash?

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 56cc66a

Greptile also left 1 inline comment on this PR.

Remove the Em Dash (U+2014) entry from the unicode
character detection list as it is redundant with
existing dash detection rules.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 15, 2026

Walkthrough

Removed the Em Dash (e28094:2014) pattern from the harmful Unicode patterns detection list in a single file. This prevents Em Dash characters from being flagged as harmful threats in Unicode validation checks while preserving all other detection patterns and functionality.

Changes

Cohort / File(s) Summary
Unicode Pattern Detection
check-for-unicode/run.sh
Removed Em Dash (e28094:2014:Em Dash) entry from the harmful_patterns list, disabling detection of Em Dash as a harmful Unicode pattern.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A dash removed with gentle care, ❌➖
No longer feared by Unicode's stare,
The Em Dash hops away so free,
From the harmful list's scrutiny! 🐰✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is related to the changeset but contains a factually incorrect rationale about the code's behavior. Clarify whether the intent is to intentionally stop detecting Em Dash entirely. The stated rationale about 'redundancy with existing dash detection rules' is inaccurate—removing U+2014 from harmful_patterns fully removes detection, not deduplication.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title references removing Em Dash from unicode check, which aligns with the main change of removing the Em Dash pattern from the harmful_patterns list.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch remove-em-dash-unicode-check
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread check-for-unicode/run.sh
@dragonfire1119 dragonfire1119 merged commit 8317698 into master Mar 15, 2026
6 checks passed
@dragonfire1119 dragonfire1119 deleted the remove-em-dash-unicode-check branch March 15, 2026 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant