Feature idea: Optional LLM callback for contextual inference of unrecoverable text

First off, thank you for ftfy - it's an incredibly useful library. I wanted to share an idea that came up while working on a dataset with mojibake corruption.

ftfy does an excellent job of recovering text when the encoding chain can be reversed. However, when bytes are truly lost (i.e., `U+FFFD`), ftfy correctly gives up because there's no deterministic way to recover what's gone. But I've noticed that LLMs can often infer the original text from linguistic context, even when the bytes are unrecoverable. Here's an example:

```
>>> fix_text("ÃƒÅ½Ã‚Â¤ÃƒÅ½Ã¢â‚¬ËœÃƒÅ½Ã…Â¾ÃƒÅ½Ã¢â€žÂ¢ÃƒÅ½Ã¢â‚¬ï¿½ÃƒÅ½Ã¢â€žÂ¢ ÃƒÅ½Ã…Â¾ÃƒÅ½Ã¢â‚¬ËœÃƒÅ½Ã¯Â¿Â½ÃƒÅ½Ã‹Å“ÃƒÅ½Ã¢â‚¬â€� 1-4/7")
'ΤΑΞΙ�Ι ΞΑ�Θ� 1-4/7'
```

An LLM assists in giving this more context and understanding:

<img width="762" height="552" alt="Image" src="https://github.com/user-attachments/assets/a97cc816-e506-47e5-9870-3d9d7ab5690f" />

The LLM brings together script recognition, lexical knowledge and geography to fill in three separate replacement characters.

I'm not suggesting ftfy should bundle an LLM or make API calls, as I suppose that would be out of scope and change the library's nature entirely.
But what about an optional callback for users who want to handle unrecoverable cases themselves? Maybe something like this:

```python
def my_llm_inference(text: str, explanation: str) -> str | None:
    """User-provided function to handle unrecoverable text.
    
    Args:
        text: The partially-fixed text with remaining issues
        explanation: ftfy's explanation of what it couldn't fix
    
    Returns:
        Inferred text, or None to keep ftfy's output as-is
    """
    # User implements their own LLM call, dictionary lookup, 
    # manual review queue, etc.
    ...

fix_text(s, unrecoverable_callback=my_llm_inference)
```

I completely understand if this is out of scope. ftfy's strength is that it's deterministic and reliable. Adding inference (even optionally) changes that guarantee. But given your NLP background, I thought you might find the intersection interesting. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature idea: Optional LLM callback for contextual inference of unrecoverable text #225

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature idea: Optional LLM callback for contextual inference of unrecoverable text #225

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions