Skip to content

docs(tts): add Qwen3-TTS documentation#28

Open
yuuhikaze wants to merge 7 commits intoOpen-LLM-VTuber:mainfrom
yuuhikaze:feat/qwen3-tts
Open

docs(tts): add Qwen3-TTS documentation#28
yuuhikaze wants to merge 7 commits intoOpen-LLM-VTuber:mainfrom
yuuhikaze:feat/qwen3-tts

Conversation

@yuuhikaze
Copy link
Copy Markdown

@yuuhikaze yuuhikaze commented Apr 25, 2026

What was documented

Added Qwen3-TTS as a new TTS engine to the backend TTS page (both English and Chinese versions).

Coverage

  • Installation instructions
  • Three operation modes: voice_clone, voice_design, and custom_voice
  • Model variant table (different parameter sizes and capabilities)
  • Hardware requirements and notes
  • Attention backend options
  • voice_clone usage tips
  • voice_design per-sentence consistency caveat with ComfyUI workflow tip

Pages updated

  • English TTS page
  • Chinese TTS page

Related

Related backend PR: Open-LLM-VTuber/Open-LLM-VTuber#378

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation

  • Added a comprehensive Qwen3‑TTS user guide covering local setup, optional dependency installation, and VRAM requirements.
  • Documented three generation modes: voice_clone, voice_design, and custom_voice, with per‑mode requirements and behavior.
  • Explained model selection rules and override options for choosing models.
  • Described voice_clone inputs (reference audio/text vs x-vector-only), voice_design’s per‑call generation and streaming inconsistency, attention backend choices, and recommended workflows for consistent voices.

yuuhikaze and others added 3 commits April 25, 2026 13:09
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ove model size bias

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da96dbc8-d6c3-45f9-a110-32d9d1f8de3a

📥 Commits

Reviewing files that changed from the base of the PR and between 0eab632 and acc0f9c.

📒 Files selected for processing (2)
  • docs/user-guide/backend/tts.md
  • i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md
✅ Files skipped from review due to trivial changes (1)
  • i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/user-guide/backend/tts.md

📝 Walkthrough

Walkthrough

Adds comprehensive Qwen3-TTS documentation covering local setup, optional dependency installation, conf.yaml configuration examples, model selection rules, three generation modes (voice_clone, voice_design, custom_voice), attention backend options, VRAM requirements, and mode-specific operational notes including voice-consistency workflows.

Changes

Cohort / File(s) Summary
Qwen3-TTS Documentation
docs/user-guide/backend/tts.md, i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md
New full user-guide for Qwen3-TTS: installation via optional qwen3-tts extras, detailed conf.yaml fields (including tts_model: qwen3_tts and qwen3_tts block), model selection mapping (model_type/model_size → HF repos) and model_path override, VRAM hints, three operational modes with required inputs (ref_audio, ref_text/ICL, x_vector_only_mode), voice_designper-call behavior and streaming consistency guidance,attention` backend options and pip-extra mappings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through docs at dawn,

Tweaked the bits where voices yawn.
Qwen3 sings in fresh new lines,
Configured hops and tiny signs—
A rabbit's cheer for TTS fine! 🎶

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding documentation for Qwen3-TTS as a new TTS backend option.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md (2)

109-110: Consider adjusting the tone for consistency.

The casual phrasing ("a bit of a square peg in a round hole", "don't expect seamless results out of the box") is notably more informal than the rest of the documentation. While the honesty about limitations is valuable, you might consider a more neutral tone that still conveys the caveat without potentially undermining user confidence in the feature.

✍️ Alternative phrasing
-Honestly, `voice_design` is a bit of a square peg in a round hole here. It wasn't really designed with a sentence-by-sentence streaming runtime in mind, so don't expect seamless results out of the box. It's there if you want to experiment, but the design→clone workflow below is how you'd actually use it for anything consistent.
+Note: `voice_design` was not originally designed for sentence-by-sentence streaming runtimes. For production use requiring consistent voice characteristics across a session, the design→clone workflow below is recommended.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md`
around lines 109 - 110, The paragraph referencing "voice_design" uses informal
idioms ("a bit of a square peg in a round hole", "don't expect seamless results
out of the box"); revise it to a more neutral, consistent tone by replacing
those phrases with a concise caveat about limitations and intended use: state
that voice_design was not primarily built for sentence-by-sentence streaming,
note that results may vary, and recommend the design→clone workflow for
consistent production use while keeping experimental encouragement; update the
sentence containing "voice_design" to this neutral wording.

120-122: Consider the longevity of the personal repository link.

The tip references a specific personal NixOS configuration. While this provides a helpful real-world example, personal repository links can become stale if the repository is renamed, deleted, or reorganized. Consider whether this example could be generalized or whether the core setup steps could be documented inline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md`
around lines 120 - 122, The tip in tts.md currently links to a personal NixOS
config (the codeberg URL) which may go stale; replace that specific personal
repository link with either: (a) a short, generalized summary of the key ComfyUI
+ Qwen3-TTS NixOS setup steps to include inline in the tip, or (b) a link to an
authoritative/maintained resource (ComfyUI docs or a community NixOS module
example) and remove the personal URL. Locate the tip block referencing ComfyUI
and Qwen3-TTS and update the content to provide stable, reproducible guidance
rather than a single-person repo link.
docs/user-guide/backend/tts.md (1)

109-110: 考虑调整语气以保持文档一致性。

这段表述("方榫插圆孔"、"开箱即用的效果不要抱太高期望")的语气明显比文档其他部分更随意。虽然坦诚说明局限性很有价值,但可以考虑使用更中性的语气,既传达注意事项又不会削弱用户对功能的信心。

✍️ 备选表述
-坦白说,`voice_design` 放在这里有点像方榫插圆孔。它本来就不是为逐句流式合成的运行时设计的,所以开箱即用的效果不要抱太高期望。如果你想折腾,功能是有的,但要真正用于实际场景,还是下面的设计→克隆工作流更靠谱。
+注意:`voice_design` 并非专为逐句流式合成运行时设计。对于需要在整个会话中保持声音一致性的生产环境使用,推荐采用下述设计→克隆工作流。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/backend/tts.md` around lines 109 - 110, 本段对 voice_design
的描述语气过于随意(如“方榫插圆孔”“开箱即用的效果不要抱太高期望”),请以文档一致的中性专业语气重写:保留对 voice_design
不是为逐句流式合成设计的事实说明,提示其在该场景下可能受限,同时指出更可靠的替代方案(如文中后续提到的“克隆工作流”)并给出简短建议或链接以便用户进一步操作;定位修改时请查找文中对
voice_design 的段落并替换为中性表述。
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user-guide/backend/tts.md`:
- Line 107:
将文中关于流式模式的描述中不规范的短语“单独的调用”改为副词结构“单独地调用”,确保句子为“在流式模式下,每句话是一次单独地调用,因此各句之间的声音会不一致。”;定位参考标识符包括
`voice_design`、`generate_audio` 和 `instruct` 来找到该段落并替换该短语以符合“副词+地+动词”的语法要求。
- Line 12: The sentence "无需单独安装运行时依赖" is misleading—update the text in
docs/user-guide/backend/tts.md to clarify that the inference code for Qwen3-TTS
is included in the source tree but Python runtime dependencies must be installed
via the optional extras (e.g., `pip install ".[qwen3-tts]"`); replace the phrase
"无需单独安装运行时依赖" with wording like "推理代码已内置于源代码树中,仅需通过可选依赖组安装 Python 依赖即可(例如运行 `pip
install \" .[qwen3-tts]\"`)" so readers understand code is included but
dependencies still need the optional install.

In `@i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md`:
- Line 11: Update the TTS introduction to clarify that while the Qwen3-TTS
inference code is vendored into the repo, users still must install Python
dependencies via the optional dependency group; replace or reword the sentence
containing "no separate runtime installation is required" (referencing the
Qwen3-TTS paragraph and the Installation section that uses uv pip install
".[qwen3-tts]") to say something like: "The inference code is vendored into the
source tree — only Python dependencies need to be installed via the optional
dependency group (e.g., uv pip install '.[qwen3-tts]')."
- Around line 74-78: Update the comment for the configuration key device: 'cuda'
to state that PyTorch interprets 'cuda' as the first GPU (cuda:0) rather than
performing automatic load-balancing across multiple GPUs; explicitly note that
on multi-GPU machines users must configure DataParallel or
DistributedDataParallel (or other multi-GPU strategies) to utilize more than one
device, and keep the existing examples 'cuda:0'/'cuda:1'/... as the way to
target a specific GPU.

---

Nitpick comments:
In `@docs/user-guide/backend/tts.md`:
- Around line 109-110: 本段对 voice_design
的描述语气过于随意(如“方榫插圆孔”“开箱即用的效果不要抱太高期望”),请以文档一致的中性专业语气重写:保留对 voice_design
不是为逐句流式合成设计的事实说明,提示其在该场景下可能受限,同时指出更可靠的替代方案(如文中后续提到的“克隆工作流”)并给出简短建议或链接以便用户进一步操作;定位修改时请查找文中对
voice_design 的段落并替换为中性表述。

In `@i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md`:
- Around line 109-110: The paragraph referencing "voice_design" uses informal
idioms ("a bit of a square peg in a round hole", "don't expect seamless results
out of the box"); revise it to a more neutral, consistent tone by replacing
those phrases with a concise caveat about limitations and intended use: state
that voice_design was not primarily built for sentence-by-sentence streaming,
note that results may vary, and recommend the design→clone workflow for
consistent production use while keeping experimental encouragement; update the
sentence containing "voice_design" to this neutral wording.
- Around line 120-122: The tip in tts.md currently links to a personal NixOS
config (the codeberg URL) which may go stale; replace that specific personal
repository link with either: (a) a short, generalized summary of the key ComfyUI
+ Qwen3-TTS NixOS setup steps to include inline in the tip, or (b) a link to an
authoritative/maintained resource (ComfyUI docs or a community NixOS module
example) and remove the personal URL. Locate the tip block referencing ComfyUI
and Qwen3-TTS and update the content to provide stable, reproducible guidance
rather than a single-person repo link.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3f3ab1b-0d4c-4b7f-b557-57178fc47174

📥 Commits

Reviewing files that changed from the base of the PR and between 622a307 and 0eab632.

📒 Files selected for processing (2)
  • docs/user-guide/backend/tts.md
  • i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md

Comment thread docs/user-guide/backend/tts.md Outdated
Comment thread docs/user-guide/backend/tts.md Outdated
Comment thread i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md Outdated
Comment thread i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/tts.md Outdated
yuuhikaze and others added 3 commits April 27, 2026 08:47
- Clarify vendored claim: inference code included, Python deps still needed
- Fix ZH grammar: 单独的调用 → 单独地调用
- Neutralise voice_design tone in EN
- Correct device: 'cuda' description (defaults to cuda:0, not auto load-balance)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Note that instruct can optionally style-modify predefined speakers
in custom_voice mode, not just voice_design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yuuhikaze
Copy link
Copy Markdown
Author

Round-2 review follow-up (commit 54cd09d).

Fixed in this round:

  • EN voice_design section (Issue 3): The previous fix left a duplicated paragraph — both the first and second paragraphs stated the same per-sentence inconsistency limitation. Merged them into a single, non-redundant paragraph that states the mechanism, the streaming limitation, and points to the design→clone workflow.
  • ZH voice_design section (Issue 3, ZH equivalent): The ZH file still had the informal/casual paragraph ("坦白说,放在这里有点像方榫插圆孔..."). That was rewritten to match the neutral tone of the corrected EN version, and the resulting duplication (same limitation stated twice) was collapsed into one paragraph.

Already correctly fixed — skipped:

  • Issue 1 (EN): "The inference code is vendored into the source tree — only Python dependencies need to be installed via the optional dependency group." Already present verbatim.
  • Issue 1 (ZH): "推理代码已内置于源代码树中,仅需通过可选依赖组安装 Python 依赖。" Already present and correct.
  • Issue 2 (ZH): "单独地调用" already in place on line 107. No change needed.
  • Issue 4 (EN and ZH): Both files already had the corrected device: 'cuda' inline comment explaining equivalence to cuda:0, no load-balancing, and explicit per-index targeting via cuda:0/cuda:1.

On the device: 'cuda' / DataParallel suggestion:

If the review suggested adding guidance on DataParallel or DistributedDataParallel for multi-GPU setups: this is intentionally out of scope. Qwen3-TTS is a single-model inference tool, not a training framework. DDP and DataParallel are training-time constructs and are not applicable here. Multi-GPU inference for this use case means pinning to a specific card via cuda:N, which is already documented. Adding DDP guidance would be misleading.

@yuuhikaze
Copy link
Copy Markdown
Author

Addressed all round-2 feedback:

  • Restructured YAML comment headers in both EN and ZH docs to match the backend config: voice_clone, custom_voice, custom_voice / voice_design, and Generation parameters sections
  • Moved instruct field under combined custom_voice, voice_design section to reflect it applies to both modes
  • Removed extra blank lines between sections for consistency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant