feat(search): add unicode character removal for fuzzy matching#4360
feat(search): add unicode character removal for fuzzy matching#43604yinn wants to merge 3 commits intoFlow-Launcher:devfrom
Conversation
Introduced a string preprocessing step in FuzzySearch that removes unicode characters. This improves the search
experience by allowing users to find results regardless of accents or special formatting.
|
🥷 Code experts: Jack251970 Jack251970 has most 👩💻 activity in the files. See details
Activity based on git-commit:
Knowledge based on git-blame: ✨ Comment |
|
Be a legend 🏆 by adding a before and after screenshot of the changes you made, especially if they are around UI/UX. |
📝 WalkthroughWalkthroughAccent-insensitive matching was added to Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
Flow.Launcher.Infrastructure/StringMatcher.cs (1)
66-73: Consider adding an option to toggle diacritics-insensitive matching.The linked issue
#4149specifically requests an option to enable/disable diacritics-insensitive matching, similar to the Everything app. The current implementation always removes accents with no way to opt out.If this is an intentional scope reduction, consider documenting it. Otherwise, you could add a setting flag:
if (_settings.IgnoreDiacritics) { query = RemoveAccents(query); stringToCompare = RemoveAccents(stringToCompare); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Flow.Launcher.Infrastructure/StringMatcher.cs` around lines 66 - 73, FuzzyMatch currently always calls RemoveAccents on query and stringToCompare which forces diacritics-insensitive matching; add a configurable toggle (e.g. a boolean setting like _settings.IgnoreDiacritics) and only call RemoveAccents when that flag is true, updating FuzzyMatch's logic and any settings class to expose the option so consumers can enable/disable diacritics-insensitive matching; reference the RemoveAccents and FuzzyMatch methods and the MatchOption usage when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Flow.Launcher.Infrastructure/StringMatcher.cs`:
- Around line 249-250: In StringMatcher.cs (inside the method using
Char.GetUnicodeCategory), rename the local variable unicodedCategory to
unicodeCategory to fix the typo and consistency, and also replace the type alias
use from Char.GetUnicodeCategory(c) to char.GetUnicodeCategory(c) for C# keyword
consistency; update any subsequent references to the variable name
(unicodeCategory) accordingly.
- Around line 72-73: The matching indices are computed on the post-RemoveAccents
normalized string but applied to the original input, causing mismatches when
combining marks change string length; update the flow in the method that calls
RemoveAccents to 1) keep the original stringToCompare (save
originalStringToCompare before normalization), 2) produce an index mapping from
normalized-to-original positions when RemoveAccents transforms the string
(similar to the existing TranslationMapping for alphabet translation), and 3)
before returning, remap all indices in indexList and any positions inside
acronymMatchData using that accent-removal mapping (in addition to the existing
TranslationMapping) so returned indices align with the original string. Ensure
the mapping logic is used wherever indices are translated back (same spot where
TranslationMapping is applied).
---
Nitpick comments:
In `@Flow.Launcher.Infrastructure/StringMatcher.cs`:
- Around line 66-73: FuzzyMatch currently always calls RemoveAccents on query
and stringToCompare which forces diacritics-insensitive matching; add a
configurable toggle (e.g. a boolean setting like _settings.IgnoreDiacritics) and
only call RemoveAccents when that flag is true, updating FuzzyMatch's logic and
any settings class to expose the option so consumers can enable/disable
diacritics-insensitive matching; reference the RemoveAccents and FuzzyMatch
methods and the MatchOption usage when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0172967a-b14a-495a-9c63-e4cd5ee587ab
📒 Files selected for processing (1)
Flow.Launcher.Infrastructure/StringMatcher.cs
| query = RemoveAccents(query); | ||
| stringToCompare = RemoveAccents(stringToCompare); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for MatchData usage to understand the highlighting mechanism
rg -n -C3 'MatchData' --type=csRepository: Flow-Launcher/Flow.Launcher
Length of output: 12041
🏁 Script executed:
#!/bin/bash
# Find the context around indexList in StringMatcher.cs
rg -n 'indexList' Flow.Launcher.Infrastructure/StringMatcher.cs -B2 -A2Repository: Flow-Launcher/Flow.Launcher
Length of output: 1657
🏁 Script executed:
#!/bin/bash
# Find where MatchResult is constructed in StringMatcher.cs to see all return paths
rg -n 'new MatchResult\|return new' Flow.Launcher.Infrastructure/StringMatcher.cs -B1 -A1Repository: Flow-Launcher/Flow.Launcher
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Get the full context of StringMatcher.cs around lines 60-90 to understand the flow
sed -n '60,90p' Flow.Launcher.Infrastructure/StringMatcher.cs | cat -nRepository: Flow-Launcher/Flow.Launcher
Length of output: 2168
🏁 Script executed:
#!/bin/bash
# Get context around line 79 where translationMapping is mentioned
sed -n '70,150p' Flow.Launcher.Infrastructure/StringMatcher.cs | cat -nRepository: Flow-Launcher/Flow.Launcher
Length of output: 4371
🏁 Script executed:
#!/bin/bash
# Check how FuzzyMatch is called and what string is highlighted
# Look at a specific plugin example
sed -n '100,110p' Plugins/Flow.Launcher.Plugin.Explorer/Search/ResultManager.cs | cat -nRepository: Flow-Launcher/Flow.Launcher
Length of output: 703
🏁 Script executed:
#!/bin/bash
# Check the Result model to understand how TitleHighlightData is used
rg -n 'TitleHighlightData' Flow.Launcher.Plugin/SharedModels/ -B2 -A2Repository: Flow-Launcher/Flow.Launcher
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Search for where TitleHighlightData is actually applied (UI rendering)
rg -n 'TitleHighlightData' --type=cs | grep -v 'test\|Test' | head -20Repository: Flow-Launcher/Flow.Launcher
Length of output: 1514
🏁 Script executed:
#!/bin/bash
# Find RemoveAccents method implementation
rg -n 'private.*RemoveAccents|public.*RemoveAccents' Flow.Launcher.Infrastructure/StringMatcher.cs -A10Repository: Flow-Launcher/Flow.Launcher
Length of output: 566
🏁 Script executed:
#!/bin/bash
# Check test cases for accent handling
find . -name '*StringMatcher*Test*' -o -name '*Test*StringMatcher*' | head -5Repository: Flow-Launcher/Flow.Launcher
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Search for test files
fd -e cs -p '*test*' | rg -i 'stringmatcher|fuzzymatch' | head -10Repository: Flow-Launcher/Flow.Launcher
Length of output: 356
🏁 Script executed:
#!/bin/bash
# Get the complete RemoveAccents method
sed -n '240,260p' Flow.Launcher.Infrastructure/StringMatcher.cs | cat -nRepository: Flow-Launcher/Flow.Launcher
Length of output: 1034
🏁 Script executed:
#!/bin/bash
# Search for test files properly
fd --glob '**/*Test*.cs' | rg -i 'stringmatcher|fuzzymatch'Repository: Flow-Launcher/Flow.Launcher
Length of output: 109
🏁 Script executed:
#!/bin/bash
# Also check if there are unit test files
find . -type f -name '*.cs' | xargs grep -l 'RemoveAccents\|FuzzyMatch' | grep -i testRepository: Flow-Launcher/Flow.Launcher
Length of output: 111
🏁 Script executed:
#!/bin/bash
# Check the test file for accent/diacritic test cases
rg -n 'RemoveAccents\|accent\|diacritic\|café\|é' Flow.Launcher.Test/FuzzyMatcherTest.cs -iRepository: Flow-Launcher/Flow.Launcher
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Get the full FuzzyMatcherTest.cs to understand test coverage
wc -l Flow.Launcher.Test/FuzzyMatcherTest.csRepository: Flow-Launcher/Flow.Launcher
Length of output: 113
🏁 Script executed:
#!/bin/bash
# Look at the test file to see if accents are tested
sed -n '1,100p' Flow.Launcher.Test/FuzzyMatcherTest.cs | cat -nRepository: Flow-Launcher/Flow.Launcher
Length of output: 4471
🏁 Script executed:
#!/bin/bash
# Search through the entire test file for any accent or diacritic related tests
cat Flow.Launcher.Test/FuzzyMatcherTest.cs | grep -n -i 'accent\|diacritic\|café\|é\|ñ\|à\|ü' | head -20Repository: Flow-Launcher/Flow.Launcher
Length of output: 53
🏁 Script executed:
#!/bin/bash
# Let's check what's in the second half of the test file
sed -n '200,300p' Flow.Launcher.Test/FuzzyMatcherTest.cs | cat -nRepository: Flow-Launcher/Flow.Launcher
Length of output: 7418
🏁 Script executed:
#!/bin/bash
# Create a simple test to understand RemoveAccents behavior
python3 << 'EOF'
# Simulate C# NormalizationForm.FormD decomposition and accent removal
import unicodedata
test_strings = [
"café", # Regular accent
"naïve", # Diaeresis
"Zürich", # Umlaut
"São Paulo", # Tilde
"e̊", # e with combining ring
]
for s in test_strings:
# FormD = Decomposed
decomposed = unicodedata.normalize('NFD', s)
# Remove combining marks (NonSpacingMark category)
removed = ''.join(c for c in decomposed if unicodedata.category(c) != 'Mn')
# FormC = Composed (what the code does)
recomposed = unicodedata.normalize('NFC', removed)
print(f"Original: '{s}' (length: {len(s)})")
print(f"Decomposed: '{decomposed}' (length: {len(decomposed)})")
print(f"Accents removed: '{removed}' (length: {len(removed)})")
print(f"Recomposed: '{recomposed}' (length: {len(recomposed)})")
print(f"Match indices broken? {len(s) != len(recomposed)}")
print()
EOFRepository: Flow-Launcher/Flow.Launcher
Length of output: 916
Index mismatch when original string contains combining marks or is in decomposed form.
When RemoveAccents normalizes the string to FormD (decomposed), removes combining marks, then recomposes to FormC, the string length may change—particularly with combining marks like ring above (e̊) that decompose to multiple characters. The indices computed during matching are based on the post-RemoveAccents string, but are used to highlight the original input string. If the original string was in decomposed form or contains combining marks, the indices will be misaligned.
Unlike the translationMapping for alphabet translation (created after accent removal), there is no equivalent mapping to adjust indices from the normalized string back to the original. No test cases currently cover this scenario.
Consider:
- Storing the original
stringToComparebefore normalization - Creating an index mapping for accent removal changes (similar to
TranslationMapping) - Applying this mapping to both
indexListandacronymMatchDatabefore returning
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Flow.Launcher.Infrastructure/StringMatcher.cs` around lines 72 - 73, The
matching indices are computed on the post-RemoveAccents normalized string but
applied to the original input, causing mismatches when combining marks change
string length; update the flow in the method that calls RemoveAccents to 1) keep
the original stringToCompare (save originalStringToCompare before
normalization), 2) produce an index mapping from normalized-to-original
positions when RemoveAccents transforms the string (similar to the existing
TranslationMapping for alphabet translation), and 3) before returning, remap all
indices in indexList and any positions inside acronymMatchData using that
accent-removal mapping (in addition to the existing TranslationMapping) so
returned indices align with the original string. Ensure the mapping logic is
used wherever indices are translated back (same spot where TranslationMapping is
applied).
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
@4yinn Do you think we need to handle this?
From my perspective as a Flow Launcher user, I think this feature is quite necessary. Many times when I search for programs that contain accents, such as “Câmera”, I tend to type the word without accents, like “camera”, because that’s how I usually type when searching in Windows.
Because of this, I often have difficulty finding applications in Flow when their names contain accents. The same issue was mentioned in the GitHub issue opened by another Brazilian user, which is why I decided to implement this improvement.
Sorry for the late response.
There was a problem hiding this comment.
@4yinn Do you think we need to handle this?
From my perspective as a Flow Launcher user, I think this feature is quite necessary. Many times when I search for programs that contain accents, such as “Câmera”, I tend to type the word without accents, like “camera”, because that’s how I usually type when searching in Windows.
Because of this, I often have difficulty finding applications in Flow when their names contain accents. The same issue was mentioned in the GitHub issue opened by another Brazilian user, which is why I decided to implement this improvement.
Sorry for the late response.
I'm sorry, it seems you misunderstood me. What I meant was that from AI's review, this removal may change the length of the string which can cause some issues like incorrect highlights.
There was a problem hiding this comment.
@4yinn Do you think we need to handle this?
From my perspective as a Flow Launcher user, I think this feature is quite necessary. Many times when I search for programs that contain accents, such as “Câmera”, I tend to type the word without accents, like “camera”, because that’s how I usually type when searching in Windows.
Because of this, I often have difficulty finding applications in Flow when their names contain accents. The same issue was mentioned in the GitHub issue opened by another Brazilian user, which is why I decided to implement this improvement.
Sorry for the late response.I'm sorry, it seems you misunderstood me. What I meant was that from AI's review, this removal may change the length of the string which can cause some issues like incorrect highlights.
Sorry for the late reply. hmm ok. Do u think this implementation isn’t necessary, or does it just need further optimization?
There was a problem hiding this comment.
@4yinn I'm not sure if this Removal will bring about any changes to the string. Since this kind of character does not exist in my language, you need to confirm this issue. If so, improvements are needed here, that is, we need to map the string back and handle the content related to highlighting.
There was a problem hiding this comment.
@4yinn I'm not sure if this Removal will bring about any changes to the string. Since this kind of character does not exist in my language, you need to confirm this issue. If so, improvements are needed here, that is, we need to map the string back and handle the content related to highlighting.
Yeah, my language is pretty boring when it comes to accents. Sorry for the late reply again — I was working. I’ll check this now and look for improvements.
There was a problem hiding this comment.
@4yinn I'm not sure if this Removal will bring about any changes to the string. Since this kind of character does not exist in my language, you need to confirm this issue. If so, improvements are needed here, that is, we need to map the string back and handle the content related to highlighting.
Yeah, my language is pretty boring when it comes to accents. Sorry for the late reply again — I was working. I’ll check this now and look for improvements.
Thanks for your reply!
There was a problem hiding this comment.
2 issues found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="Flow.Launcher.Infrastructure/StringMatcher.cs">
<violation number="1" location="Flow.Launcher.Infrastructure/StringMatcher.cs:72">
P1: `query` is not revalidated after accent stripping, so mark-only Unicode input can become empty and crash at `querySubstrings[0]`.</violation>
<violation number="2" location="Flow.Launcher.Infrastructure/StringMatcher.cs:73">
P2: Match indices are computed after accent-stripping but never mapped back to original string positions, causing incorrect highlight offsets for decomposed Unicode text.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Add one-off context when rerunning by tagging
@cubic-dev-aiwith guidance or docs links (includingllms.txt) - Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Jack251970
left a comment
There was a problem hiding this comment.
Thanks for your contribution! But from AI's review, this removal may change the length of the string which can cause some issues like incorrect highlights.
|
Please don't involve too much code format changes so we can trace changes in the future. |
jjw24
left a comment
There was a problem hiding this comment.
Thank you for your PR.
StringMatcher is performance-critical, so we must minimize additional overhead. Please address the following:
-
Consolidate Loops: The current implementation performs an initial loop to identify and remove Unicode characters, followed by a second loop for the main matching logic. Please merge it into the main loop to avoid redundant iteration.
-
Conditional Unicode Processing: We should make Unicode removal optional—for instance, by adding a toggle for non-English languages. This prevents unnecessary processing when the feature isn't required.
Yes, I can see the problem. I'll think about a way to implement it with better optimizations. |
Description
Added a preprocessing step to
FuzzySearchto normalize Unicode characters and remove diacritics before matching.This allows searches to be accent-insensitive, improving usability. For example, searching for
camerawill also matchcâmera.Related Issue
Closes #4149
Summary by cubic
Makes fuzzy search accent-insensitive by normalizing and stripping Unicode diacritics before matching. Adds null/empty input checks to avoid unnecessary processing and false matches (relates to #4149).
FuzzyMatchnow trims and removes diacritics from both query and target before translation/matching; minor formatting tweaks only.RemoveAccents(string)using Unicode normalization (FormD) and filteringNonSpacingMark, then re-composing (FormC); early return when query is whitespace orstringToCompareis empty after preprocessing.StringBuilder; early check avoids extra work; no long-lived impact.Written for commit f905530. Summary will update on new commits.