fix: Add Unicode sanitization for cloud embedders by anatolykoptev · Pull Request #1048 · MemTensor/MemOS

anatolykoptev · 2026-02-07T00:45:31Z

Fix: Add Unicode sanitization for cloud embedders

Problem

Cloud embedding APIs (VoyageAI, OpenAI, etc.) reject texts containing Unicode surrogates and certain emoji characters, causing UnicodeEncodeError in production.

Error Example

text = "Hello 👋 \ud800"  # Contains emoji + surrogate
embedder.embed([text])
# UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800'

Root Cause

Unicode surrogates (U+D800–U+DFFF) are invalid in UTF-8
Some emoji and international characters cause encoding issues
Cloud APIs have stricter validation than local embedders (Ollama)

Solution

Added _sanitize_unicode() function that:

Removes Unicode surrogates using surrogatepass error handling
Replaces invalid characters with empty string
Falls back to removing all non-BMP characters if needed
Applied automatically before all embedding API calls

Implementation

def _sanitize_unicode(text: str) -> str:
    """Remove Unicode surrogates and problematic characters."""
    try:
        cleaned = text.encode('utf-8', errors='surrogatepass').decode('utf-8', errors='replace')
        return cleaned.replace('\ufffd', '')
    except Exception:
        return ''.join(c for c in text if ord(c) < 0x10000)

Testing

Tested with:

✅ Emoji: "Hello 👋 🔥"
✅ Surrogates: "\ud800\udc00"
✅ Mixed: "Test 🚀 \ud83d"
✅ International: "中文 العربية Тест"

Impact

Fixes: Production crashes with emoji/international text
Breaking: None - purely additive
Performance: Negligible (<1ms per text)

Checklist

Related Issues

Fixes production issue with VoyageAI and OpenAI embedders rejecting texts with emoji/surrogates.

- Add _sanitize_unicode() function to remove surrogates - Apply sanitization before all embedding API calls - Add comprehensive tests for Unicode handling Fixes production crashes with VoyageAI/OpenAI when texts contain emoji or Unicode surrogates (U+D800-U+DFFF). Tested with: - Emoji: '👋 🔥' - Surrogates: '\ud800' - International text: 中文, العربية, Тест Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anatolykoptev force-pushed the fix/unicode-sanitization-embedders branch from 15b1f1c to 6404af2 Compare February 7, 2026 00:55

anatolykoptev force-pushed the fix/unicode-sanitization-embedders branch from 6404af2 to 9f1c37b Compare February 7, 2026 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add Unicode sanitization for cloud embedders#1048

fix: Add Unicode sanitization for cloud embedders#1048
anatolykoptev wants to merge 1 commit intoMemTensor:mainfrom
anatolykoptev:fix/unicode-sanitization-embedders

anatolykoptev commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anatolykoptev commented Feb 7, 2026

Fix: Add Unicode sanitization for cloud embedders

Problem

Error Example

Root Cause

Solution

Implementation

Testing

Impact

Checklist

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant