feat: expressiveness mode, stateless Instructions, structured LLM output#5635
feat: expressiveness mode, stateless Instructions, structured LLM output#5635theomonnom wants to merge 23 commits intomainfrom
Conversation
- Add expressiveness flag (Agent + AgentSession) that auto-injects TTS markup instructions and speaker context into LLM system messages - Rework Instructions from str subclass to stateless class with common/audio/text fields. No Pydantic dependency, no runtime state. - Add AgentInstructions with expressiveness templates, WorkflowInstructions replaces InstructionParts - Add TTS Markup inner class (llm_instructions + to_text) with shared _provider_format.py for Cartesia/ElevenLabs - Add RecognizeStream.context + SpeakerContext protocol for STT metadata - Privatize AudioRecognition, expose only stt_context - Add llm_output_format class-level attribute for structured LLM output with streaming JSON partial parsing - Add llm.Response annotation, ChatMessage.llm_output field - Validate all llm_output_format fields have defaults at class definition
BufferedTokenStream now holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like <spell>U.S.A.</spell>. Batch path in blingfire also merges split-tag sentences. Removes unused TagAwareBuffer β tokenizer handles it natively. Fixes AgentConfigUpdate.instructions to use str instead of Instructions. 21 regression tests for batch + streaming with all TTS tag patterns.
Covers batch + streaming paths with: self-closing tags, wrapping tags, periods in attributes and content, abbreviations (U.S.A., N.A.S.A.), phoneme with IPA/arpabet, chunk boundary splits, char-by-char streaming, unicode (French, Chinese, emoji), mixed tags, and a realistic multi-sentence conversation.
β¦cised Blingfire doesn't split tiny fragments. Tests now use realistic multi-sentence content inside tags so splits actually trigger and the XML-aware merge is verified.
bea7e82 to
c36f944
Compare
| if text_transforms: | ||
| input = _apply_text_transforms(input, text_transforms) | ||
| # text transforms only apply to plain text mode (no structured output) | ||
| input = _apply_text_transforms(input, text_transforms) # type: ignore[arg-type] |
There was a problem hiding this comment.
π΄ Text transforms crash at runtime when llm_output_format sends BaseModel objects through the TTS pipeline
When llm_output_format is set on an Agent, _llm_inference_task (generation.py:219-225) sends BaseModel objects through text_ch. These flow into _tts_inference_task where _apply_text_transforms is applied unconditionally at line 325. The default text transforms (filter_markdown and filter_emoji) perform string operations like buffer += chunk (filters.py:103) and EMOJI_PATTERN.sub("", chunk) (filters.py:156) that will raise TypeError when chunk is a BaseModel instead of str.
Since DEFAULT_TTS_TEXT_TRANSFORMS = ["filter_markdown", "filter_emoji"] is always active by default, any Agent using llm_output_format will crash at runtime unless the user explicitly sets tts_text_transforms=None. The comment at line 324 acknowledges the incompatibility but no guard is implemented.
Was this helpful? React with π or π to provide feedback.
| current_span.set_attribute(trace_types.ATTR_SPEECH_ID, speech_handle.id) | ||
| if instructions is not None: | ||
| current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, instructions) | ||
| current_span.set_attribute(trace_types.ATTR_INSTRUCTIONS, str(instructions)) |
There was a problem hiding this comment.
this only adds the common part to the trace but not the version used for this turn?
There was a problem hiding this comment.
Fixed β trace now shows the modality-resolved text, not just the common part.
| chat_ctx.add_message(role="system", content=[instructions]) | ||
| # re-resolve instructions for the current turn's modality | ||
| turn_modality = speech_handle.input_details.modality | ||
| turn_instructions = instructions if instructions is not None else self._agent.instructions |
There was a problem hiding this comment.
is that expected for replacing the original instructions with the turn instructions entirely?
β¦failures - ChatContext only stores str, never Instructions objects - Per-turn modality resolution only when Instructions has audio/text variants - Plain str instructions pass through unchanged (no re-resolution) - Revert unintended fake_llm changes - Fix add_message to resolve Instructions to str
| Instructions("You are a helpful assistant.") | ||
|
|
||
| @property | ||
| def audio(self) -> str: |
There was a problem hiding this comment.
this is breaking? IMO we should keep this just as a wrapper.. it's much easier to write instructions.text instead of instructions.as_modality('text')
There was a problem hiding this comment.
Instructions was supposed to be in beta, I'm not sure if anybody is using it
| rtc.EventEmitter[Literal["metrics_collected", "error"] | TEvent], | ||
| Generic[TEvent], | ||
| ): | ||
| class Markup: |
There was a problem hiding this comment.
nit, not sure about the name. it wraps a TTS. why not expose these on TTS itself?
There was a problem hiding this comment.
It is the case tho?
tts.markup?
| llm: NotGivenOr[llm.LLM | llm.RealtimeModel | LLMModels | str | None] = NOT_GIVEN, | ||
| tts: NotGivenOr[tts.TTS | TTSModels | str | None] = NOT_GIVEN, | ||
| mcp_servers: NotGivenOr[list[mcp.MCPServer] | None] = NOT_GIVEN, | ||
| expressiveness: NotGivenOr[bool] = NOT_GIVEN, |
There was a problem hiding this comment.
if a user wanted to override how they prompt the LLM for expressiveness. where should they do it?
should this be a bool | ExpressivenessOptions?
There was a problem hiding this comment.
They do it inside the new AgentInstructions class
| return self._interruption_detection | ||
|
|
||
| @property | ||
| def expressiveness(self) -> NotGivenOr[bool]: |
There was a problem hiding this comment.
if we want options, then it'd be better to always return options vs a bool
| str(instructions) if not isinstance(instructions, str) else instructions | ||
| ) | ||
|
|
||
| class _SafeFormatter(string.Formatter): |
β¦with render(), improved provider prompts - ExpressivenessOptions moved to agent_session.py as TypedDict with DEFAULT_EXPRESSIVENESS_OPTIONS - Instructions: removed format/as_modality/__add__, added render(modality, data) returning str - Instructions: added resolve_template() static method for workflow modality-aware composition - safe_render utility in utils/misc.py with nested dictβSimpleNamespace, error logging with full dotted paths - Template data uses explicit dicts with proper namespaces (tts.markup.llm_instructions, audio_recognition.stt_context.emotion) - AudioRecognition.llm_instructions() method matching tts.markup.llm_instructions() API - Cartesia prompt: complete 62 emotion list, examples, XML format explained - ElevenLabs prompt: normalization rules, SSML tags, examples - Removed _concat_optional, _safe_format, AgentInstructions
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
π 1 issue in files not directly in the diff
π AgentConfigUpdate raises ValidationError when Agent.instructions is an Instructions object (livekit-agents/livekit/agents/voice/agent_activity.py:771)
At agent_activity.py:770-771, self._agent.instructions (typed as str | Instructions) is passed directly to llm.AgentConfigUpdate(instructions=...), whose field is typed str | None. The old Instructions class was a str subclass and had a custom __get_pydantic_core_schema__, so Pydantic accepted it. The refactored Instructions is a plain class with neither, so Pydantic v2 rejects it with ValidationError: Input should be a valid string. This crashes any agent created with Agent(instructions=Instructions(...)) when the activity starts.
View 12 additional findings in Devin Review.
β¦Labs v3) <expression value="..."/> is the XML bridge for providers that use [] brackets natively. The LLM always generates XML, plugins convert to native format before sending to API. - Cartesia: native XML, no conversion needed - ElevenLabs v2: native SSML, no conversion - ElevenLabs v3: <expression> β [laughs], [whispers], etc. - Inworld TTS 2: <expression> β [say excitedly], [laugh], etc. Added TTS.Markup.convert() method, convert_expression_tags() and strip_bracket_tags() helpers, complete provider prompts with examples.
β¦ting self._markup Base TTS.__init__ calls self.Markup(self) automatically. Plugins just define their Markup inner class β no manual self._markup assignment needed.
- Complete steering prompt with free-form delivery, non-verbals, breaks, emphasis - Based on Inworld TTS 2 docs (steering, prompting best practices) - Add inworld-tts-2 to inference gateway models - Gateway detects TTS 2 vs older models (only TTS 2 supports steering) - Remove IPA and asterisk emphasis from prompt (framework doesn't strip these)
β¦xamples Before: cartesia ~464, elevenlabs ~301, elevenlabs_v3 ~363, inworld ~905 tokens After: cartesia ~294, elevenlabs ~110, elevenlabs_v3 ~174, inworld ~349 tokens
β¦veness 11 delivery styles from casual to extreme, practical non-verbal examples, conversational and emotional range the LLM can reference.
There was a problem hiding this comment.
Devin Review found 2 new potential issues.
β οΈ 1 issue in files not directly in the diff
β οΈ Per-chunk markup stripping in streaming transcript fails when XML tags span LLM tokens (livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617)
When expressiveness is enabled, _read_text at livekit-agents/livekit/agents/voice/agent_activity.py:2616-2617 calls self.tts.markup.to_text(chunk) on individual LLM output tokens. Since to_text uses regex to match complete XML tags (strip_xml_tags in livekit-agents/livekit/agents/tts/markup_utils.py:37), partial tags spanning multiple tokens (e.g., <emotion then value="happy"/>) won't be matched and will leak into the real-time user transcript. The final transcript stored in chat history at livekit-agents/livekit/agents/voice/agent_activity.py:2787-2788 IS correctly stripped because it operates on the full accumulated text, so only the streaming display is affected.
View 17 additional findings in Devin Review.
| def __str__(self) -> str: | ||
| return self.common | ||
|
|
||
| Both ``_audio_variant`` and ``_text_variant`` are preserved so this can | ||
| be called again for a different modality (e.g. across tool-call turns). | ||
| """ | ||
| return Instructions( | ||
| audio=self._audio_variant, | ||
| text=self._text_variant, | ||
| _represent=self.audio if modality == "audio" else self.text, | ||
| ) | ||
| def __repr__(self) -> str: | ||
| return f"Instructions({self.common!r})" | ||
|
|
||
| def __hash__(self) -> int: | ||
| return hash((self.common, self.audio, self.text)) |
There was a problem hiding this comment.
π‘ Instructions.eq with str violates hash contract
Instructions.__eq__ returns True when compared to a plain str with the same common value, but __hash__ produces a different value (it hashes a 3-tuple of (common, audio, text)). This violates Python's data model invariant: if a == b, then hash(a) == hash(b) must hold. This causes incorrect behavior when Instructions and str objects are mixed in sets or used as dict keys.
Demonstration
instr = Instructions("hello")
assert instr == "hello" # True
assert hash(instr) == hash("hello") # False! Violates contract
d = {"hello": 1}
d[instr] # May raise KeyError despite instr == "hello"| def __str__(self) -> str: | |
| return self.common | |
| Both ``_audio_variant`` and ``_text_variant`` are preserved so this can | |
| be called again for a different modality (e.g. across tool-call turns). | |
| """ | |
| return Instructions( | |
| audio=self._audio_variant, | |
| text=self._text_variant, | |
| _represent=self.audio if modality == "audio" else self.text, | |
| ) | |
| def __repr__(self) -> str: | |
| return f"Instructions({self.common!r})" | |
| def __hash__(self) -> int: | |
| return hash((self.common, self.audio, self.text)) | |
| def __hash__(self) -> int: | |
| return hash(self.common) | |
Was this helpful? React with π or π to provide feedback.
Prevents sending chunks exceeding provider limits (e.g. Inworld 1000 chars). Splits at sentence boundaries β never mid-sentence. max_input_len dict in _provider_format.py, used by inference gateway. Switched gateway from basic to blingfire tokenizer.
The convert() was called on individual LLM tokens β too early, the regex never saw complete <expression> tags. Now converts after the sentence tokenizer accumulates full sentences, right before sending to the API. Removed token-level convert from default tts_node. Added debug logging in gateway showing converted text sent to API. Cleaned up debug prints from drive-thru example.
All examples now layer mood + energy + pacing + vocal style. Added singing example. Removed bland short labels the LLM was copying.
β¦ributes [^>]* greedily consumed the / before >, so <expression value="..."/> was never detected as self-closing. The tokenizer thought every tag was unclosed and held the entire buffer, causing all text to merge into one chunk (hitting Inworld's 1000 char limit). Fix: [^>]*? (non-greedy) stops before the /.
Verifies that <expression value="..."/> and similar self-closing tags with attributes do NOT block sentence splitting (batch + streaming).
There was a problem hiding this comment.
Devin Review found 3 new potential issues.
π 1 issue in files not directly in the diff
π Transcription stream drops all content when llm_output_format is set, breaking chat history and user-facing transcripts (livekit-agents/livekit/agents/voice/agent_activity.py:2613-2615)
When Agent.llm_output_format is set for structured LLM output, _llm_inference_task sends BaseModel instances (not plain strings) to text_ch (livekit-agents/livekit/agents/voice/generation.py:224). This channel is tee'd into tts_text_input and tr_input. The TTS path correctly handles BaseModel in the default tts_node (livekit-agents/livekit/agents/voice/agent.py:542-549), extracting the response field delta. However, the transcription path's _read_text wrapper unconditionally skips all BaseModel instances (isinstance(chunk, (FlushSentinel, BaseModel)): continue), yielding nothing to the transcription_node. This means text_out.text is empty, which cascades to forwarded_text being empty, so the assistant's message is never added to chat_ctx β breaking conversation history, user-facing transcription, and any downstream logic that depends on the assistant message existing in the chat context.
View 20 additional findings in Devin Review.
| # incomplete tag at end: < without matching > | ||
| last_open = text.rfind("<") | ||
| last_close = text.rfind(">") | ||
| if last_open > last_close: |
There was a problem hiding this comment.
π‘ _has_unclosed_xml_tags false positive on < in non-XML text prevents sentence splitting
The _has_unclosed_xml_tags function in token_stream.py returns True whenever the text contains a < that appears after the last >, even in regular prose like "the price is < 5 dollars. That's cheap.". The check at lines 23-25 (last_open = text.rfind("<"); last_close = text.rfind(">"); if last_open > last_close: return True) fires for any bare < character, causing the streaming tokenizer to hold the entire buffer and never split sentences. This could stall TTS output for any text containing mathematical comparisons, template syntax, or other non-XML uses of <.
Was this helpful? React with π or π to provide feedback.
| def __init__( | ||
| self, | ||
| common: str = "", | ||
| *, | ||
| audio: str | None = None, | ||
| text: str | None = None, | ||
| ) -> None: | ||
| self.common = common | ||
| self.audio = audio | ||
| self.text = text |
There was a problem hiding this comment.
π΄ Workflow tasks still use old Instructions(audio_text, text=text_text) positional-arg pattern
The Instructions.__init__ signature changed from (audio, *, text=None) to (common='', *, audio=None, text=None). Several workflow files that were NOT updated by this PR still construct Instructions with the old pattern where the first positional arg was the audio-specific variant:
Affected files
livekit-agents/livekit/agents/beta/workflows/phone_number.py:82-96livekit-agents/livekit/agents/beta/workflows/dob.py:89-105livekit-agents/livekit/agents/beta/workflows/name.py:114-133livekit-agents/livekit/agents/beta/workflows/credit_card.py:165-177,:293-...,:388-...
With the old API, Instructions(audio_text, text=text_text) stored audio_text as the audio variant and text_text as the text variant β mutually exclusive. With the new API, audio_text becomes the common field (included in ALL modalities) and text_text becomes the text addition (appended to common). So render(modality="text") now returns audio_text + "\n\n" + text_text β concatenating audio-specific and text-specific instructions together, which is incorrect and will produce garbled LLM prompts in text mode.
Prompt for agents
The Instructions constructor was changed from Instructions(audio, *, text=None) to Instructions(common, *, audio=None, text=None), but several workflow files still use the old positional-arg pattern Instructions(audio_text, text=text_text). These files need to be updated to the new signature. The correct migration for each call site like Instructions(audio_text, text=text_text) would be Instructions(common='', audio=audio_text, text=text_text) β making common empty and placing the modality-specific text in the audio and text params. Alternatively, these workflow tasks could be refactored to use the WorkflowInstructions/resolve() pattern like address.py and email_address.py were updated. Affected files: beta/workflows/phone_number.py, beta/workflows/dob.py, beta/workflows/name.py, beta/workflows/credit_card.py.
Was this helpful? React with π or π to provide feedback.
Summary
strsubclass to plain class withcommon/audio/textRecognizeStream.context+SpeakerContextprotocolstt_contextllm_output_formatwithllm.Responseannotation, streaming JSON partial parsingTTS.Markupinner class, shared_provider_format.pyfor Cartesia/ElevenLabsBufferedTokenStreamholds back tokens with unclosed XML tags (53 regression tests)InstructionPartsExpressiveness mode
The framework injects system messages telling the LLM about available TTS tags:
The LLM then uses markup naturally. Markup is stripped from transcripts and chat history:
<emotion value="sad"/> I understand how you feel.<emotion value="sad"/> I understand how you feel.I understand how you feel.I understand how you feel.Custom templates and per-plugin overrides:
ElevenLabs example with normalization:
Stateless Instructions
Reworked from
strsubclass to plain class. No Pydantic, no runtime state.Hierarchy:
InstructionsβAgentInstructionsβWorkflowInstructionsInstructionPartsremoved, replaced byWorkflowInstructions(AgentInstructions).STT speaker context + AudioRecognition
STT plugins set metadata on their stream. Accessible anywhere on the Agent:
AudioRecognitionis now a public class but all fields and methods are private β onlystt_contextis exposed.Structured LLM output
All fields must have defaults β validated at class definition via
__init_subclass__. LLM is configured for structured output, JSON is streamed and partially parsed viapydantic_core.from_json(allow_partial=True).tts_nodereceivesBaseModelchunks (explicit opt-in β existing customtts_nodeimplementations that only handlestrare unaffected unlessllm_output_formatis set):Parsed output stored on
ChatMessage.llm_output:XML-aware sentence tokenizer
BufferedTokenStreamnow holds back tokens that contain unclosed XML tags, preventing sentence splits inside markup like<spell>U.S.A.</spell>. Blingfire batch path also merges split-tag sentences. 53 regression tests covering self-closing tags, wrapping tags, decimals in attributes, nested tags, chunk boundary splits, unicode, and a realistic multi-sentence conversation.