fix: do not speak ssml markup#741
Conversation
Add a simple test to get a baseline of sanitization behavior before changing the client
Update the server to conditionally no longer html escape SSML content. In cases where we have PA message templates with predefined audio text to send to Polly, we store the audio text as a single string. This string may include SSML, e.g. ``` """ To make space for other passengers and speed up boarding, please take off your backpack before entering the train and hold it at your side. <lang xml:lang=\"es-US\"> Para dejar espacio para otros pasajeros y acelerar el embarque, por favor quítese la mochila y manténgala a su lado. </lang> """ ``` Where the latter half of that message is to be spoken in spanish by Polly. However, we currently HTML escape the entire string, causing the XML tags to be transformed from `<lang>` to `<lang>`. This causes Polly to speak "lang" as if it were a word, rather than an SSML instruction. This is done in favor of introducing an XML parser and escaping only the text nodes in the parsed document as we assume that predefined audio text such as this is one, valid XML and two, immutable by the client. This eliminates the potential for handling edge cases with respect to XML parsing. This is done in favor of altering the data model to split audio text into different strings per language to minimize the scope of this effort.
Add prosody rate of 90% and drc effect to spoken text to bring spoken text into parity with RTS
| Fetches an audio file from Watts given a string. | ||
| """ | ||
| @callback fetch_tts(String.t()) :: {:ok, binary()} | :error | ||
| @callback fetch_tts(String.t(), boolean) :: {:ok, binary()} | :error |
There was a problem hiding this comment.
Asking for opinions here - boolean parameters here feel like a code smell. Unsure if that's Uncle Bob whispering in my ear, the ghost of Crystal Reports past, or something else. Would it be better (i.e. more idiomatic) to have an opts arg and match on a has_ssml field in there?
| const [phoneticText, setPhoneticText] = useState( | ||
| defaultValues?.audio_text ?? "", | ||
| ); | ||
| const [phoneticTextHasSsml, setPhoneticTextHasSsml] = useState<boolean>( |
There was a problem hiding this comment.
I think between here and MainForm I've covered the various cases where this state is required, but asking someone with a little more experience to take a second look and evaluate that through the UI to double check me here
| Jason.encode!(%{ | ||
| text: | ||
| ~s(<speak><amazon:effect name="drc"><prosody rate="90%">#{text}</prosody></amazon:effect></speak>), | ||
| voice_id: "Matthew" |
There was a problem hiding this comment.
One call out here - we hardcode "Matthew" here. However, for Spanish in RTS, we use "Mia". Any thoughts/concerns on this divergence?
Asana task: Bug: PA Messages with SSML readout markup
Description
Update the server to conditionally no longer html escape SSML content.
In cases where we have PA message templates with predefined audio text
to send to Polly, we store the audio text as a single string. This
string may include SSML, e.g.
Where the latter half of that message is to be spoken in spanish by
Polly. However, we currently HTML escape the entire string, causing the
XML tags to be transformed from
<lang>to<lang>. This causesPolly to speak "lang" as if it were a word, rather than an SSML
instruction.
This is done in favor of introducing an XML parser and escaping only the
text nodes in the parsed document as we assume that predefined audio
text such as this is one, valid XML and two, immutable by the client.
This eliminates the potential for handling edge cases with respect to
XML parsing.
This is done in favor of altering the data model to split audio text
into different strings per language to minimize the scope of this effort.
Add prosody rate of 90% and drc effect to spoken text to bring spoken
text into parity with RTS