Skip to content

Feature idea: Optional LLM callback for contextual inference of unrecoverable text #225

@jeiros

Description

@jeiros

First off, thank you for ftfy - it's an incredibly useful library. I wanted to share an idea that came up while working on a dataset with mojibake corruption.

ftfy does an excellent job of recovering text when the encoding chain can be reversed. However, when bytes are truly lost (i.e., U+FFFD), ftfy correctly gives up because there's no deterministic way to recover what's gone. But I've noticed that LLMs can often infer the original text from linguistic context, even when the bytes are unrecoverable. Here's an example:

>>> fix_text("ΤΑΞΙ�Ι ΞΑ�Θ� 1-4/7")
'ΤΑΞΙ�Ι ΞΑ�Θ� 1-4/7'

An LLM assists in giving this more context and understanding:

Image

The LLM brings together script recognition, lexical knowledge and geography to fill in three separate replacement characters.

I'm not suggesting ftfy should bundle an LLM or make API calls, as I suppose that would be out of scope and change the library's nature entirely.
But what about an optional callback for users who want to handle unrecoverable cases themselves? Maybe something like this:

def my_llm_inference(text: str, explanation: str) -> str | None:
    """User-provided function to handle unrecoverable text.
    
    Args:
        text: The partially-fixed text with remaining issues
        explanation: ftfy's explanation of what it couldn't fix
    
    Returns:
        Inferred text, or None to keep ftfy's output as-is
    """
    # User implements their own LLM call, dictionary lookup, 
    # manual review queue, etc.
    ...

fix_text(s, unrecoverable_callback=my_llm_inference)

I completely understand if this is out of scope. ftfy's strength is that it's deterministic and reliable. Adding inference (even optionally) changes that guarantee. But given your NLP background, I thought you might find the intersection interesting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions