-
Notifications
You must be signed in to change notification settings - Fork 124
Description
First off, thank you for ftfy - it's an incredibly useful library. I wanted to share an idea that came up while working on a dataset with mojibake corruption.
ftfy does an excellent job of recovering text when the encoding chain can be reversed. However, when bytes are truly lost (i.e., U+FFFD), ftfy correctly gives up because there's no deterministic way to recover what's gone. But I've noticed that LLMs can often infer the original text from linguistic context, even when the bytes are unrecoverable. Here's an example:
>>> fix_text("ΤΑΞΙ�Ι ΞΑ�Θ� 1-4/7")
'ΤΑΞΙ�Ι ΞΑ�Θ� 1-4/7'
An LLM assists in giving this more context and understanding:
The LLM brings together script recognition, lexical knowledge and geography to fill in three separate replacement characters.
I'm not suggesting ftfy should bundle an LLM or make API calls, as I suppose that would be out of scope and change the library's nature entirely.
But what about an optional callback for users who want to handle unrecoverable cases themselves? Maybe something like this:
def my_llm_inference(text: str, explanation: str) -> str | None:
"""User-provided function to handle unrecoverable text.
Args:
text: The partially-fixed text with remaining issues
explanation: ftfy's explanation of what it couldn't fix
Returns:
Inferred text, or None to keep ftfy's output as-is
"""
# User implements their own LLM call, dictionary lookup,
# manual review queue, etc.
...
fix_text(s, unrecoverable_callback=my_llm_inference)I completely understand if this is out of scope. ftfy's strength is that it's deterministic and reliable. Adding inference (even optionally) changes that guarantee. But given your NLP background, I thought you might find the intersection interesting.