Input formats for translation and language detection

While the spec doesn't seem to say anything about supported formats, in practice, inputting HTML seems to work just fine (and surprisingly fine even) for translations:

```js
const translator = await Translator.create({
 sourceLanguage: 'en',
 targetLanguage: 'es',
});

await translator.translate('Hello, or should I say good morning?');
// 'Hola, o debería decir Buenos días?'
```

However, the same can't be said for language detection:

```js
const languageDetector = await LanguageDetector.create();
await languageDetector.detect('Guten Morgen');
/*
[
 {confidence: 0.4321225881576538, detectedLanguage: 'en'},
 {confidence: 0.20436188578605652, detectedLanguage: 'de'}, // 👈 Correct
 …
]
*/
```

Test-wise stripping the HTML from the input string correctly results in the language to be detected as German. 

Do we need something like [`SummarizerFormat`](https://webmachinelearning.github.io/writing-assistance-apis/#enumdef-summarizerformat) but for the Translator and Language Detector APIs to, for example, set expectations that corresponding words like the `good morning` would also be contained in the output (`Buenos días`) and to make sure the developer doesn't need to do an `innerText` dance to strip HTML formatting for language detection?

The three to-be-supported formats ideally would be the same as for the Writing Assistance APIs, that is, plain text and Markdown.

As a developer, I'd also wish for direct HTML support, as this avoids a to-Markdown dance, which may lose styling like `class`es that were in the original HTML.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Input formats for translation and language detection #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input formats for translation and language detection #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions