Skip to content

Input formats for translation and language detection #60

@tomayac

Description

@tomayac

While the spec doesn't seem to say anything about supported formats, in practice, inputting HTML seems to work just fine (and surprisingly fine even) for translations:

const translator = await Translator.create({
  sourceLanguage: 'en',
  targetLanguage: 'es',
});

await translator.translate('<p>Hello, or should I say <strong>good morning</strong>?</p>');
// '<p>Hola, o debería decir <strong>Buenos días</strong>?</p>'

However, the same can't be said for language detection:

const languageDetector = await LanguageDetector.create();
await languageDetector.detect('<strong>Guten Morgen</strong>');
/*
[
  {confidence: 0.4321225881576538, detectedLanguage: 'en'},
  {confidence: 0.20436188578605652, detectedLanguage: 'de'}, // 👈 Correct

]
*/

Test-wise stripping the HTML from the input string correctly results in the language to be detected as German.

Do we need something like SummarizerFormat but for the Translator and Language Detector APIs to, for example, set expectations that corresponding words like the <strong>good morning</strong> would also be contained in the output (<strong>Buenos días</strong>) and to make sure the developer doesn't need to do an innerText dance to strip HTML formatting for language detection?

The three to-be-supported formats ideally would be the same as for the Writing Assistance APIs, that is, plain text and Markdown.

As a developer, I'd also wish for direct HTML support, as this avoids a to-Markdown dance, which may lose styling like classes that were in the original HTML.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions