Skip to content

fix: do not speak ssml markup#741

Draft
rwaskiewicz wants to merge 5 commits into
mainfrom
rw/ssml-markup-readout
Draft

fix: do not speak ssml markup#741
rwaskiewicz wants to merge 5 commits into
mainfrom
rw/ssml-markup-readout

Conversation

@rwaskiewicz
Copy link
Copy Markdown
Contributor

@rwaskiewicz rwaskiewicz commented May 29, 2026

Asana task: Bug: PA Messages with SSML readout markup

Description

Update the server to conditionally no longer html escape SSML content.

In cases where we have PA message templates with predefined audio text
to send to Polly, we store the audio text as a single string. This
string may include SSML, e.g.

"""
To make space for other passengers and speed up boarding, please take off your backpack before entering the train and hold it at your side.

<lang xml:lang=\"es-US\"> Para dejar espacio para otros pasajeros y acelerar el embarque, por favor quítese la mochila y manténgala a su lado. </lang>
"""

Where the latter half of that message is to be spoken in spanish by
Polly. However, we currently HTML escape the entire string, causing the
XML tags to be transformed from <lang> to &lt;lang&gt;. This causes
Polly to speak "lang" as if it were a word, rather than an SSML
instruction.

This is done in favor of introducing an XML parser and escaping only the
text nodes in the parsed document as we assume that predefined audio
text such as this is one, valid XML and two, immutable by the client.
This eliminates the potential for handling edge cases with respect to
XML parsing.

This is done in favor of altering the data model to split audio text
into different strings per language to minimize the scope of this effort.

Add prosody rate of 90% and drc effect to spoken text to bring spoken
text into parity with RTS

  • For features with a design/UX component, deployed branch to dev-green and let product know it's ready for review.

Add a simple test to get a baseline of sanitization behavior before
changing the client
Update the server to conditionally no longer html escape SSML content.

In cases where we have PA message templates with predefined audio text
to send to Polly, we store the audio text as a single string. This
string may include SSML, e.g.
```
"""
To make space for other passengers and speed up boarding, please take off your backpack before entering the train and hold it at your side.

<lang xml:lang=\"es-US\"> Para dejar espacio para otros pasajeros y acelerar el embarque, por favor quítese la mochila y manténgala a su lado. </lang>
"""
```
Where the latter half of that message is to be spoken in spanish by
Polly. However, we currently HTML escape the entire string, causing the
XML tags to be transformed from `<lang>` to `&lt;lang&gt;`. This causes
Polly to speak "lang" as if it were a word, rather than an SSML
instruction.

This is done in favor of introducing an XML parser and escaping only the
text nodes in the parsed document as we assume that predefined audio
text such as this is one, valid XML and two, immutable by the client.
This eliminates the potential for handling edge cases with respect to
XML parsing.

This is done in favor of altering the data model to split audio text
into different strings per language to minimize the scope of this effort.
Add prosody rate of 90% and drc effect to spoken text to bring spoken
text into parity with RTS
Fetches an audio file from Watts given a string.
"""
@callback fetch_tts(String.t()) :: {:ok, binary()} | :error
@callback fetch_tts(String.t(), boolean) :: {:ok, binary()} | :error
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for opinions here - boolean parameters here feel like a code smell. Unsure if that's Uncle Bob whispering in my ear, the ghost of Crystal Reports past, or something else. Would it be better (i.e. more idiomatic) to have an opts arg and match on a has_ssml field in there?

const [phoneticText, setPhoneticText] = useState(
defaultValues?.audio_text ?? "",
);
const [phoneticTextHasSsml, setPhoneticTextHasSsml] = useState<boolean>(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think between here and MainForm I've covered the various cases where this state is required, but asking someone with a little more experience to take a second look and evaluate that through the UI to double check me here

Jason.encode!(%{
text:
~s(<speak><amazon:effect name="drc"><prosody rate="90%">#{text}</prosody></amazon:effect></speak>),
voice_id: "Matthew"
Copy link
Copy Markdown
Contributor Author

@rwaskiewicz rwaskiewicz May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One call out here - we hardcode "Matthew" here. However, for Spanish in RTS, we use "Mia". Any thoughts/concerns on this divergence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant