-
Notifications
You must be signed in to change notification settings - Fork 17
Description
While the spec doesn't seem to say anything about supported formats, in practice, inputting HTML seems to work just fine (and surprisingly fine even) for translations:
const translator = await Translator.create({
sourceLanguage: 'en',
targetLanguage: 'es',
});
await translator.translate('<p>Hello, or should I say <strong>good morning</strong>?</p>');
// '<p>Hola, o debería decir <strong>Buenos días</strong>?</p>'However, the same can't be said for language detection:
const languageDetector = await LanguageDetector.create();
await languageDetector.detect('<strong>Guten Morgen</strong>');
/*
[
{confidence: 0.4321225881576538, detectedLanguage: 'en'},
{confidence: 0.20436188578605652, detectedLanguage: 'de'}, // 👈 Correct
…
]
*/Test-wise stripping the HTML from the input string correctly results in the language to be detected as German.
Do we need something like SummarizerFormat but for the Translator and Language Detector APIs to, for example, set expectations that corresponding words like the <strong>good morning</strong> would also be contained in the output (<strong>Buenos días</strong>) and to make sure the developer doesn't need to do an innerText dance to strip HTML formatting for language detection?
The three to-be-supported formats ideally would be the same as for the Writing Assistance APIs, that is, plain text and Markdown.
As a developer, I'd also wish for direct HTML support, as this avoids a to-Markdown dance, which may lose styling like classes that were in the original HTML.