Skip to content

feat: Support overriding language ID of the text emitted#1837

Open
cqjjjzr wants to merge 1 commit into
rime:masterfrom
cqjjjzr:commit-langid
Open

feat: Support overriding language ID of the text emitted#1837
cqjjjzr wants to merge 1 commit into
rime:masterfrom
cqjjjzr:commit-langid

Conversation

@cqjjjzr
Copy link
Copy Markdown

@cqjjjzr cqjjjzr commented Apr 9, 2026

参照:

发现 weasel::Config 里面的字段其实是没有被用到的(可以被 style 替代),因此去掉并换成了 LANGID,主要是利用 TSF 在输入ITfRange 时可以能通过 GUID_PROP_LANGID 指定这段文本的语言的特性,覆盖掉由于当前键盘设置导致的使用 RIME 输入的其它语言文本被指定为中文而引起字体、拼写检查的错误。

增加了新的配置项,可能需要 document。

另外 pre-edit 文本闪烁的问题也解决了,如下图(注意到日文自动切换到了 Yu Mincho,而中文使用默认的等线)

image

@fxliang fxliang requested a review from Copilot April 10, 2026 14:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a mechanism to override the TSF language ID applied to committed / composing text, so apps can use the correct font/spellcheck language even when the active keyboard layout would otherwise force a different LANGID.

Changes:

  • Introduces a commit_langid config value transported over IPC and stored per session.
  • Applies GUID_PROP_LANGID on TSF ranges (composition start, inline preedit updates, and committed text insertion).
  • Removes reliance on the previously-unused Config::inline_preedit field and uses UI style instead.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
WeaselTSF/WeaselTSF.h Adds _SetRangeLanguage API and _textLangId storage for TSF language override.
WeaselTSF/EditSession.cpp Reads commit_langid from IPC response and switches inline-preedit decision to style.
WeaselTSF/DisplayAttribute.cpp Implements setting GUID_PROP_LANGID on a TSF range.
WeaselTSF/Composition.cpp Applies the language override to composition/preedit/commit ranges.
WeaselIPC/Configurator.cpp Parses config.commit_langid from IPC messages.
RimeWithWeasel/RimeWithWeasel.cpp Loads locale-based override from configs and emits config.commit_langid over IPC.
include/WeaselIPCData.h Updates IPC Config struct to carry commit_langid.
include/RimeWithWeasel.h Extends session status to store commit_langid and adds loader method declaration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread WeaselTSF/EditSession.cpp
Comment on lines 18 to 21
if (ok) {
bool inline_preedit = _cand->style().inline_preedit;
_textLangId = static_cast<LANGID>(config.commit_langid);
if (!commit.empty()) {
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_textLangId is stored as a mutable WeaselTSF member, but the actual language assignment happens later in separately requested (potentially async) edit sessions (_StartComposition/_InsertText/_ShowInlinePreedit). If multiple edit sessions are queued, _textLangId can be overwritten before those sessions run, causing the wrong LANGID to be applied to the range. Consider capturing the langid per edit-session instance (store it in the edit session object) or applying the GUID_PROP_LANGID value within the same edit session that sets the text.

Copilot uses AI. Check for mistakes.
Comment thread WeaselTSF/EditSession.cpp

if (ok) {
bool inline_preedit = _cand->style().inline_preedit;
_textLangId = static_cast<LANGID>(config.commit_langid);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting config.commit_langid (int) to LANGID will silently truncate out-of-range values. Please validate/clamp to the valid LANGID range (0..0xFFFF) and treat invalid values as “no override” (or log) to avoid setting an unintended language ID.

Suggested change
_textLangId = static_cast<LANGID>(config.commit_langid);
if (config.commit_langid >= 0 && config.commit_langid <= 0xFFFF) {
_textLangId = static_cast<LANGID>(config.commit_langid);
}

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +69
if (SUCCEEDED(pContext->GetProperty(GUID_PROP_LANGID, &pLangIdProperty))) {
VARIANT var;
var.vt = VT_I4;
var.lVal = _textLangId;
hr = pLangIdProperty->SetValue(ec, pRange, &var);
pLangIdProperty->Release();
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VARIANT var is used without VariantInit(). Even though only VT_I4 is assigned, initializing the VARIANT avoids uninitialized padding/fields and aligns with COM best practices (same pattern also exists in _SetCompositionDisplayAttributes).

Copilot uses AI. Check for mistakes.
Comment on lines 15 to 21
void Configurator::Store(Deserializer::KeyType const& key,
std::wstring const& value) {
if (!m_pTarget->p_context || key.size() < 2)
if (!m_pTarget->p_config || key.size() < 2)
return;
bool bool_value = (!value.empty() && value != L"0");
if (key[1] == L"inline_preedit") {
m_pTarget->p_config->inline_preedit = bool_value;
if (key[1] == L"commit_langid") {
m_pTarget->p_config->commit_langid = _wtoi(value.c_str());
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing commit_langid with _wtoi provides no error reporting and accepts negative / out-of-range values, which then flow into TSF language property setting. Prefer a checked parse (e.g., wcstol with endptr) and clamp/reject values outside the valid LANGID range to keep IPC robust against malformed inputs.

Copilot uses AI. Check for mistakes.
if (!lcid)
return false;

*commit_langid = LANGIDFROMLCID(lcid);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TryGetLangIdFromConfig writes the resulting LANGID into an int without any bounds/type enforcement. Since this value is later serialized over IPC and cast to LANGID, it’s safer to explicitly constrain it to 0..0xFFFF (and treat 0 as “not set”) before storing/sending.

Suggested change
*commit_langid = LANGIDFROMLCID(lcid);
const unsigned long langid = static_cast<unsigned long>(LANGIDFROMLCID(lcid));
if (langid == 0 || langid > 0xFFFFul)
return false;
*commit_langid = static_cast<int>(langid);

Copilot uses AI. Check for mistakes.
@lotem
Copy link
Copy Markdown
Member

lotem commented May 26, 2026

我來參加,最終還是看梁老師決策~

配置中的 commit_langid 是個魔數嗎?用戶可能不會設。
可否內建一個映射表,識別常見 語言代碼,如中文 zh 日文 ja 朝鮮文 ko 等。

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 26, 2026

我來參加,最終還是看梁老師決策~

配置中的 commit_langid 是個魔數嗎?用戶可能不會設。 可否內建一個映射表,識別常見 語言代碼,如中文 zh 日文 ja 朝鮮文 ko 等。

具体的映射在这里: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/63d3d639-7fd2-4afb-abbe-0d5b5551eef8

是可行的,不过语言代码 <=> 微软的 LANGID 并非一一对应...

@lotem
Copy link
Copy Markdown
Member

lotem commented May 27, 2026

哦……行吧。那就讓用家自己查。

我在想,要不要註冊輸入法的時候給這個參數,免得每次往輸入片段上設置。安裝程序還可以提供多種語言選項。
https://github.com/rime/weasel/blob/master/WeaselSetup/imesetup.cpp#L403

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 27, 2026

哦……行吧。那就讓用家自己查。

我在想,要不要註冊輸入法的時候給這個參數,免得每次往輸入片段上設置。安裝程序還可以提供多種語言選項。 https://github.com/rime/weasel/blob/master/WeaselSetup/imesetup.cpp#L403

那样就对全局生效了,但如果全局例如设置成日语,那中文输入又会出问题了,所以至少需要每个 schema 设置,因为输入法本身选择哪个 schema 并不会体现在全局状态上(我并没找到“临时切换一个输入法语言”的 API)。

image

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 27, 2026

哦……行吧。那就讓用家自己查。

我在想,要不要註冊輸入法的時候給這個參數,免得每次往輸入片段上設置。安裝程序還可以提供多種語言選項。 https://github.com/rime/weasel/blob/master/WeaselSetup/imesetup.cpp#L403

其实我有做一个类似的事情,只是没有pr还

2b0ce5c
80c4424

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 27, 2026

再想下,其实这样也是不完整的,万一方案中有混合输出的内容,有中文有英文有日文的,如何应对?

研究下微软拼音的简繁的切换的实现,我猜可能是有API可以切换状态的

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 27, 2026

这样就要求 librime 本身具备能标记一次编辑生成的文本中各个段的语言了,恐怕有点做不完。「同时支持多种语言的方案」这一需求是否真的存在也留待调查(至少这个 PR 提供了临时修改一次编辑的语言的基础设施)我后面看看MS拼音的实现,但我目前的确没搜出来具体的“临时切换输入法 LANGID 而不每次修改ITfRange语言”的做法。

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 27, 2026

ITfInputProcessorProfiles::ChangeCurrentLanguage

https://learn.microsoft.com/zh-cn/windows/win32/api/msctf/nf-msctf-itfinputprocessorprofiles-changecurrentlanguage
看看是不是可用

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 27, 2026

其实可以再简单一点,只改tsf加菜单选择语言只对当前应用有效

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 27, 2026

ITfInputProcessorProfiles::ChangeCurrentLanguage

https://learn.microsoft.com/zh-cn/windows/win32/api/msctf/nf-msctf-itfinputprocessorprofiles-changecurrentlanguage 看看是不是可用

这会切换整个输入法,类比按了 Alt+Shift (即 Windows 7 下的 Ctrl+Space)

其实可以再简单一点,只改tsf加菜单选择语言只对当前应用有效

没太看懂

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 27, 2026

ITfInputProcessorProfiles::ChangeCurrentLanguage
https://learn.microsoft.com/zh-cn/windows/win32/api/msctf/nf-msctf-itfinputprocessorprofiles-changecurrentlanguage 看看是不是可用

这会切换整个输入法,类比按了 Alt+Shift (即 Windows 7 下的 Ctrl+Space)

其实可以再简单一点,只改tsf加菜单选择语言只对当前应用有效

没太看懂

你现在的语言期望从方案或weasel.yaml传来,其实可以再简单一点, 在语言栏按钮的右键菜单里做一个层级菜单,里面有若干你要用的语言可选,选择之后对当前应用就按新的语言来标识。这样不用动服务不用动ipc,只要给tsf改就行。

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 27, 2026

你现在的语言期望从方案或weasel.yaml传来,其实可以再简单一点, 在语言栏按钮的右键菜单里做一个层级菜单,里面有若干你要用的语言可选,选择之后对当前应用就按新的语言来标识。这样不用动服务不用动ipc,只要给tsf改就行。

一个方案基本只会有一个语言,假设我只有中文和日文方案,那么在 Word 写作时基本每次切换方案都需要一起切换语言,不是特别方便。我感觉还是想办法联动一下

@lotem
Copy link
Copy Markdown
Member

lotem commented May 29, 2026

我看現在這個 PR 也是全局生效的對麼;在運行時標註語言,但配置仍是全局的;那就不如註冊輸入法的時候聲明語言。

之前有人提過,想要把輸入法註冊到其他的語言/地區。
這應該能解決大多數相關需求。
真正要混合多語言輸入的不多,何況也沒法做得完美。

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 29, 2026

我看現在這個 PR 也是全局生效的對麼;在運行時標註語言,但配置仍是全局的;那就不如註冊輸入法的時候聲明語言。

并不是,我是在 <schema>.schema.yaml 里面指定的,并且重新看了一下代码,这个 PR 是会把 ja-JP 这样的 BCP47 转成 LANGID,不需要在配置里面写 2052。

因为是跟着 schema 走,切换方案就会变更输入语言。

之前有人提過,想要把輸入法註冊到其他的語言/地區。

这是另一个问题了,目前用注册表 trick 就能解决

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 29, 2026

我的想法是,如果不能根本解决问题,更好的办法是更小耦合的方案先用。直接上了一个未必成熟的方案后面要改会break很麻烦

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 29, 2026

我的想法是,如果不能根本解决问题,更好的办法是更小耦合的方案先用。直接上了一个未必成熟的方案后面要改会break很麻烦

这里的核心问题是“LANGID 需要跟着方案切换而一并切换”,因此除非完全引入新的配置源,否则总需要有一种方法把 schema 里面的配置送到 TSF 前端去,可以放在 schema 配置,可以放在 weasel 配置,但 IPC 恐怕是不得不改的(除非直接在 TSF 端去读文件)。

不过这个因为引入了新的配置项,的确需要更多讨论。不如先看看另外几个正确性bug。

真正要混合多語言輸入的不多

考虑这个问题在 Discussion 里面也多次有人提过,而在这里用中文提问的人自然大概也会用 RIME 打中文,而如果他们要用 RIME 打别的什么语言遇到问题想注册到其它语言上并用了注册表之类的各种 trick,那这种“多语言输入”(在不同 schema 上打不同语言)应该是很自然的需求。

@fxliang
Copy link
Copy Markdown
Contributor

fxliang commented May 29, 2026

有个不成熟的想法,根据候选自动切换langid, 测试Demo输出如下,性能应该是够用的,其他语言应该有扩展的机会

编译命令

cl langid.cpp LanguageDetectorW.cpp /utf-8 /EHsc && langid.exe

+--------------+--------------------------------+----------+----------+--------+-------------+
|类型          |文本预览                        |结果      |预期      |状态    |耗时(ns)     |
+--------------+--------------------------------+----------+----------+--------+-------------+
|简体中文      |你好,这是一个测试。            |    0x0804|    0x0804|  PASS  |         4600|
|繁体中文      |你好,這是一個測試。            |    0x0404|    0x0404|  PASS  |         1500|
|简体短文      |我们一起去图书馆看书。          |    0x0804|    0x0804|  PASS  |         1600|
|繁体短文      |我們一起去圖書館看書。          |    0x0404|    0x0404|  PASS  |         1500|
|日文          |こんにちは世界、これはテス...   |    0x0411|    0x0411|  PASS  |         1100|
|韩文          |안녕하세요 세계, 이것은 테...   |    0x0412|    0x0412|  PASS  |         1400|
|英文          |Hello World, this is a test.    |    0x0409|    0x0409|  PASS  |         1100|
|短中文        |你好                            |    0x0804|    0x0804|  PASS  |          400|
|中英混合      |Hello 世界                      |    0x0804|    0x0804|  PASS  |          700|
|繁体短句      |這本書很有趣                    |    0x0404|    0x0404|  PASS  |          900|
|简体短句      |这本书很有趣                    |    0x0804|    0x0804|  PASS  |          500|
+--------------+--------------------------------+----------+----------+--------+-------------+

+-------------------------------------------------------------+
| 边界测试                                                    |
+-------------------------------------------------------------+
|文本: "" - 空字符串 -> English                               |
|文本: "1234567890" - 纯数字 -> English                       |
|文本: "!@#$%^&*()" - 纯符号 -> English                       |
|文本: "abc123汉字" - 混合字符串 -> 简体中文                  |
|文本: "𠀀𠀁𠀂" - CJK扩展B区字符 -> 简体中文                  |
+-------------------------------------------------------------+

+-------------------------------------------------------------+
| 性能测试 (10000次调用)                                      |
+-------------------------------------------------------------+
| 文本长度:     1021 字符                                     |
| 总耗时:        185 ms                                      |
| 平均:        18598 ns/次                           |
+-------------------------------------------------------------+

+-------------------------------------------------------------+
| 准确率测试                                                  |
+-------------------------------------------------------------+
| 所有测试均通过!                                            |
+-------------------------------------------------------------+
| 准确率:    10/10 (100.0%)                                 |
+-------------------------------------------------------------+

+------------------------------+----------+-----------------------------------+
| 语言                         | LANGID   | 宏定义                            |
+------------------------------+----------+-----------------------------------+
| 简体中文                     | 0x0804   | LANG_CHINESE_SIMPLIFIED           |
| 繁体中文                     | 0x0404   | LANG_CHINESE_TRADITIONAL          |
| 日文                         | 0x0411   | LANG_JAPANESE                     |
| 韩文                         | 0x0412   | LANG_KOREAN                       |
| 英文                         | 0x0409   | LANG_ENGLISH_US                   |
+------------------------------+----------+-----------------------------------+

LanguageDetectorW.h

#ifndef LANGUAGE_DETECTOR_W_H
#define LANGUAGE_DETECTOR_W_H

#include <array>
#include <cstdint>
#include <string>
#include <vector>

#ifndef LANG_CHINESE_SIMPLIFIED
#define LANG_CHINESE_SIMPLIFIED 0x0804
#endif

#ifndef LANG_CHINESE_TRADITIONAL
#define LANG_CHINESE_TRADITIONAL 0x0404
#endif

#ifndef LANG_ENGLISH_US
#define LANG_ENGLISH_US 0x0409
#endif

#ifndef LANG_JAPANESE
#define LANG_JAPANESE 0x0411
#endif

#ifndef LANG_KOREAN
#define LANG_KOREAN 0x0412
#endif

constexpr uint16_t DETECT_LANG_CHINESE_SIMPLIFIED = 0x0804;
constexpr uint16_t DETECT_LANG_CHINESE_TRADITIONAL = 0x0404;
constexpr uint16_t DETECT_LANG_ENGLISH_US = 0x0409;
constexpr uint16_t DETECT_LANG_JAPANESE = 0x0411;
constexpr uint16_t DETECT_LANG_KOREAN = 0x0412;

class LanguageDetectorW {
public:
  static uint16_t DetectLangId(const std::wstring &text);
  static const wchar_t *GetLangName(uint16_t lang_id);

private:
  struct CharPair {
    wchar_t simplified;
    wchar_t traditional;
  };

  static const std::vector<CharPair> &GetCharPairs();
  static const std::array<uint8_t, 65536> &GetCharClassTable();
  static bool NextCodePoint(const std::wstring &text, size_t &index,
                            char32_t &code_point);
  static bool CanMapToWchar(char32_t cp);
  static bool IsHiragana(char32_t ch);
  static bool IsKatakana(char32_t ch);
  static bool IsHangul(char32_t ch);
  static bool IsExtendedCJK(char32_t ch);
  static bool IsCJK(char32_t ch);
};

#endif

LanguageDetectorW.cpp

#include "LanguageDetectorW.h"

const std::vector<LanguageDetectorW::CharPair> &
LanguageDetectorW::GetCharPairs() {
  static const std::vector<CharPair> pairs = {
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'线', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'广', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'访', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'亿', L''}, {L'', L''}, {L'', L''},
      {L'', L''}, {L'', L''}, {L'', L''},
  };
  return pairs;
}

const std::array<uint8_t, 65536> &LanguageDetectorW::GetCharClassTable() {
  static const std::array<uint8_t, 65536> table = []() {
    std::array<uint8_t, 65536> t{};
    for (const auto &pair : GetCharPairs()) {
      t[static_cast<uint16_t>(pair.simplified)] |= 0x01;
      t[static_cast<uint16_t>(pair.traditional)] |= 0x02;
    }
    return t;
  }();
  return table;
}

bool LanguageDetectorW::NextCodePoint(const std::wstring &text, size_t &index,
                                      char32_t &code_point) {
#if defined(_WIN32) || defined(_WIN64)
  wchar_t lead = text[index++];
  if (lead >= 0xD800 && lead <= 0xDBFF) {
    if (index < text.size()) {
      wchar_t trail = text[index];
      if (trail >= 0xDC00 && trail <= 0xDFFF) {
        ++index;
        code_point = 0x10000 + ((static_cast<char32_t>(lead) - 0xD800) << 10) +
                     (static_cast<char32_t>(trail) - 0xDC00);
        return true;
      }
    }
    return false;
  }
  if (lead >= 0xDC00 && lead <= 0xDFFF) {
    return false;
  }
  code_point = static_cast<char32_t>(lead);
  return true;
#else
  code_point = static_cast<char32_t>(text[index++]);
  return true;
#endif
}

bool LanguageDetectorW::CanMapToWchar(char32_t cp) {
#if defined(_WIN32) || defined(_WIN64)
  return cp <= 0xFFFF;
#else
  return true;
#endif
}

bool LanguageDetectorW::IsHiragana(char32_t ch) {
  return (ch >= 0x3040 && ch <= 0x309F);
}

bool LanguageDetectorW::IsKatakana(char32_t ch) {
  return (ch >= 0x30A0 && ch <= 0x30FF);
}

bool LanguageDetectorW::IsHangul(char32_t ch) {
  return (ch >= 0xAC00 && ch <= 0xD7AF);
}

bool LanguageDetectorW::IsExtendedCJK(char32_t ch) {
  return (ch >= 0x3400 && ch <= 0x4DBF) ||   // Extension A
         (ch >= 0x20000 && ch <= 0x2A6DF) || // Extension B
         (ch >= 0x2A700 && ch <= 0x2B73F) || // Extension C
         (ch >= 0x2B740 && ch <= 0x2B81F) || // Extension D
         (ch >= 0x2B820 && ch <= 0x2CEAF) || // Extension E
         (ch >= 0x2CEB0 && ch <= 0x2EBEF) || // Extension F
         (ch >= 0x30000 && ch <= 0x3134F) || // Extension G
         (ch >= 0x31350 && ch <= 0x323AF) || // Extension H
         (ch >= 0x2EBF0 && ch <= 0x2EE5F) || // Extension I
         (ch >= 0x323B0 && ch <= 0x3347F) || // Extension J
         (ch >= 0xF900 && ch <= 0xFAFF) ||   // Compatibility Ideographs
         (ch >= 0x2F800 &&
          ch <= 0x2FA1F); // Compatibility Ideographs Supplement
}

bool LanguageDetectorW::IsCJK(char32_t ch) {
  return (ch >= 0x4E00 && ch <= 0x9FFF) || IsExtendedCJK(ch);
}

uint16_t LanguageDetectorW::DetectLangId(const std::wstring &text) {
  static const auto &char_table = GetCharClassTable();

  if (text.empty()) {
    return DETECT_LANG_ENGLISH_US;
  }

  uint32_t simp = 0, trad = 0, kana = 0, hangul = 0, cjk = 0;

  for (size_t i = 0; i < text.size();) {
    char32_t ch = 0;
    if (!NextCodePoint(text, i, ch)) {
      continue;
    }
    if (IsHiragana(ch) || IsKatakana(ch)) {
      ++kana;
    } else if (IsHangul(ch)) {
      ++hangul;
    } else if (IsCJK(ch)) {
      ++cjk;
      if (CanMapToWchar(ch)) {
        const uint8_t cls = char_table[static_cast<uint16_t>(ch)];
        if (cls & 0x01) {
          ++simp;
        } else if (cls & 0x02) {
          ++trad;
        }
      }
    }
  }

  if (kana > 0) {
    return DETECT_LANG_JAPANESE;
  }
  if (hangul > 0) {
    return DETECT_LANG_KOREAN;
  }

  int total = static_cast<int>(simp + trad);
  if (total >= 2) {
    return (simp >= trad) ? DETECT_LANG_CHINESE_SIMPLIFIED
                          : DETECT_LANG_CHINESE_TRADITIONAL;
  }
  if (total == 1 && cjk >= 3) {
    return (simp > 0) ? DETECT_LANG_CHINESE_SIMPLIFIED
                      : DETECT_LANG_CHINESE_TRADITIONAL;
  }
  return (cjk > 0) ? DETECT_LANG_CHINESE_SIMPLIFIED : DETECT_LANG_ENGLISH_US;
}

const wchar_t *LanguageDetectorW::GetLangName(uint16_t lang_id) {
  switch (lang_id) {
  case DETECT_LANG_CHINESE_SIMPLIFIED:
    return L"简体中文";
  case DETECT_LANG_CHINESE_TRADITIONAL:
    return L"繁體中文";
  case DETECT_LANG_JAPANESE:
    return L"日本語";
  case DETECT_LANG_KOREAN:
    return L"한국어";
  case DETECT_LANG_ENGLISH_US:
    return L"English";
  default:
    return L"Unknown";
  }
}

langid.cpp

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

#ifdef _WIN32
#include <windows.h>
#endif

#include "LanguageDetectorW.h"

// ----------------------------------------------------------------------------
// 辅助函数:将 wstring 转换为 UTF-8 string
std::string WStringToUTF8(const std::wstring &wstr) {
  if (wstr.empty())
    return std::string();

#ifdef _WIN32
  int size_needed = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(),
                                        (int)wstr.size(), NULL, 0, NULL, NULL);
  if (size_needed <= 0)
    return std::string();
  std::string result(size_needed, 0);
  WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), (int)wstr.size(), &result[0],
                      size_needed, NULL, NULL);
  return result;
#else
  return std::string(wstr.begin(), wstr.end());
#endif
}

bool NextCodePointForDisplay(const std::wstring &text, size_t &index,
                             char32_t &code_point) {
#if defined(_WIN32) || defined(_WIN64)
  wchar_t lead = text[index++];
  if (lead >= 0xD800 && lead <= 0xDBFF) {
    if (index < text.size()) {
      wchar_t trail = text[index];
      if (trail >= 0xDC00 && trail <= 0xDFFF) {
        ++index;
        code_point = 0x10000 + ((static_cast<char32_t>(lead) - 0xD800) << 10) +
                     (static_cast<char32_t>(trail) - 0xDC00);
        return true;
      }
    }
    code_point = 0xFFFD;
    return true;
  }
  if (lead >= 0xDC00 && lead <= 0xDFFF) {
    code_point = 0xFFFD;
    return true;
  }
  code_point = static_cast<char32_t>(lead);
  return true;
#else
  code_point = static_cast<char32_t>(text[index++]);
  return true;
#endif
}

bool IsCombiningMark(char32_t u) {
  return (u >= 0x0300 && u <= 0x036F) || (u >= 0x1AB0 && u <= 0x1AFF) ||
         (u >= 0x1DC0 && u <= 0x1DFF) || (u >= 0x20D0 && u <= 0x20FF) ||
         (u >= 0xFE20 && u <= 0xFE2F);
}

bool IsWideOrFullwidth(char32_t u) {
  if ((u >= 0x1100 && u <= 0x115F) || (u >= 0x2329 && u <= 0x232A) ||
      (u >= 0x2E80 && u <= 0xA4CF) || (u >= 0xAC00 && u <= 0xD7A3) ||
      (u >= 0xF900 && u <= 0xFAFF) || (u >= 0xFE10 && u <= 0xFE19) ||
      (u >= 0xFE30 && u <= 0xFE6F) || (u >= 0xFF00 && u <= 0xFF60) ||
      (u >= 0xFFE0 && u <= 0xFFE6)) {
    return true;
  }
  if ((u >= 0x20000 && u <= 0x2FFFD) || (u >= 0x30000 && u <= 0x3FFFD) ||
      (u >= 0x2F800 && u <= 0x2FA1F)) {
    return true;
  }
  if ((u >= 0x1F300 && u <= 0x1F64F) || (u >= 0x1F680 && u <= 0x1F6FF) ||
      (u >= 0x1F900 && u <= 0x1F9FF) || (u >= 0x1FA70 && u <= 0x1FAFF) ||
      (u >= 0x1F1E6 && u <= 0x1F1FF)) {
    return true;
  }
  return false;
}

bool IsAmbiguousEaw(char32_t u) {
  if ((u >= 0x0391 && u <= 0x03A1) || (u >= 0x03A3 && u <= 0x03A9) ||
      (u >= 0x03B1 && u <= 0x03C1) || (u >= 0x03C3 && u <= 0x03C9)) {
    return true;
  }
  if (u == 0x0401 || (u >= 0x0410 && u <= 0x044F) || u == 0x0451) {
    return true;
  }
  if ((u >= 0x2010 && u <= 0x2016) || (u >= 0x2018 && u <= 0x2019) ||
      (u >= 0x201C && u <= 0x201D) || (u >= 0x2020 && u <= 0x2022) ||
      (u >= 0x2024 && u <= 0x2027) || u == 0x2030 ||
      (u >= 0x2032 && u <= 0x2033) || u == 0x2035 || u == 0x203B ||
      u == 0x203E) {
    return true;
  }
  if (u == 0x20AC) {
    return true;
  }
  if ((u >= 0x2190 && u <= 0x2199) || u == 0x21D2 || u == 0x21D4) {
    return true;
  }
  if ((u >= 0x2460 && u <= 0x24E9) || (u >= 0x2500 && u <= 0x257F) ||
      (u >= 0x2580 && u <= 0x259F) || (u >= 0x25A0 && u <= 0x25FF) ||
      (u >= 0x2600 && u <= 0x267F) || (u >= 0x2680 && u <= 0x26FF) ||
      (u >= 0x2700 && u <= 0x27BF) || (u >= 0x2B50 && u <= 0x2B59)) {
    return true;
  }
  return false;
}

int CodePointDisplayWidth(char32_t cp, bool ambiguous_is_wide = true) {
  if (IsCombiningMark(cp)) {
    return 0;
  }
  if (IsWideOrFullwidth(cp) || (ambiguous_is_wide && IsAmbiguousEaw(cp))) {
    return 2;
  }
  return 1;
}

int EawDisplayWidthUTF8(const std::string &s, bool ambiguous_is_wide = true) {
  int width = 0;
  const size_t len = s.size();
  size_t i = 0;
  while (i < len) {
    const unsigned char c = static_cast<unsigned char>(s[i]);
    char32_t cp = 0;
    size_t char_len = 0;

    if (c < 0x80) {
      cp = c;
      char_len = 1;
    } else if (c >= 0xC0 && c < 0xE0) {
      if (i + 1 >= len) {
        ++width;
        ++i;
        continue;
      }
      const unsigned char c2 = static_cast<unsigned char>(s[i + 1]);
      cp = ((c & 0x1F) << 6) | (c2 & 0x3F);
      char_len = 2;
    } else if (c >= 0xE0 && c < 0xF0) {
      if (i + 2 >= len) {
        ++width;
        ++i;
        continue;
      }
      const unsigned char c2 = static_cast<unsigned char>(s[i + 1]);
      const unsigned char c3 = static_cast<unsigned char>(s[i + 2]);
      cp = ((c & 0x0F) << 12) | ((c2 & 0x3F) << 6) | (c3 & 0x3F);
      char_len = 3;
    } else if (c >= 0xF0 && c < 0xF8) {
      if (i + 3 >= len) {
        ++width;
        ++i;
        continue;
      }
      const unsigned char c2 = static_cast<unsigned char>(s[i + 1]);
      const unsigned char c3 = static_cast<unsigned char>(s[i + 2]);
      const unsigned char c4 = static_cast<unsigned char>(s[i + 3]);
      cp = ((c & 0x07) << 18) | ((c2 & 0x3F) << 12) | ((c3 & 0x3F) << 6) |
           (c4 & 0x3F);
      char_len = 4;
    } else {
      ++width;
      ++i;
      continue;
    }

    i += char_len;
    width += CodePointDisplayWidth(cp, ambiguous_is_wide);
  }
  return width;
}

int DisplayWidth(const std::wstring &text) {
  return EawDisplayWidthUTF8(WStringToUTF8(text), true);
}

std::wstring GetTextPreviewW(const std::wstring &text, int max_width = 30) {
  if (max_width <= 3) {
    return L"...";
  }
  if (DisplayWidth(text) <= max_width) {
    return text;
  }

  int width = 0;
  std::wstring out;
  for (size_t i = 0; i < text.size();) {
    size_t old_i = i;
    char32_t cp = 0;
    NextCodePointForDisplay(text, i, cp);
    int cw = CodePointDisplayWidth(cp);
    if (width + cw > max_width - 3) {
      out += L"...";
      return out;
    }
    out.append(text, old_i, i - old_i);
    width += cw;
  }
  return out;
}

std::string WCharPtrToUTF8(const wchar_t *wstr) {
  if (wstr == nullptr) {
    return std::string();
  }
  return WStringToUTF8(std::wstring(wstr));
}

std::string FitOrPadCellUTF8(const std::string &text, int width,
                             bool align_left = true) {
  int w = EawDisplayWidthUTF8(text);
  if (w >= width) {
    return text;
  }
  int pad = width - w;
  if (align_left) {
    return text + std::string(pad, ' ');
  }
  return std::string(pad, ' ') + text;
}

std::string FitOrPadCellW(const std::wstring &text, int width,
                          bool align_left = true) {
  return FitOrPadCellUTF8(WStringToUTF8(text), width, align_left);
}

std::string CenterCellASCII(const std::string &text, int width) {
  int w = static_cast<int>(text.size());
  if (w >= width) {
    return text;
  }
  int total = width - w;
  int left = total / 2;
  int right = total - left;
  return std::string(left, ' ') + text + std::string(right, ' ');
}

// 格式化输出 LANGID
std::string FormatLangId(uint16_t lang_id) {
  std::ostringstream oss;
  oss << "0x" << std::hex << std::uppercase << std::setw(4) << std::setfill('0')
      << lang_id;
  return oss.str();
}

// ----------------------------------------------------------------------------

int main() {
#ifdef _WIN32
  SetConsoleOutputCP(CP_UTF8);
#endif

  // 测试用例
  struct TestCase {
    const wchar_t *name;
    std::wstring text;
    uint16_t expected;
  };

  std::vector<TestCase> tests = {
      {L"简体中文", L"你好,这是一个测试。", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"繁体中文", L"你好,這是一個測試。", DETECT_LANG_CHINESE_TRADITIONAL},
      {L"简体短文", L"我们一起去图书馆看书。", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"繁体短文", L"我們一起去圖書館看書。", DETECT_LANG_CHINESE_TRADITIONAL},
      {L"日文", L"こんにちは世界、これはテストです。", DETECT_LANG_JAPANESE},
      {L"韩文", L"안녕하세요 세계, 이것은 테스트입니다.", DETECT_LANG_KOREAN},
      {L"英文", L"Hello World, this is a test.", DETECT_LANG_ENGLISH_US},
      {L"短中文", L"你好", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"中英混合", L"Hello 世界", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"繁体短句", L"這本書很有趣", DETECT_LANG_CHINESE_TRADITIONAL},
      {L"简体短句", L"这本书很有趣", DETECT_LANG_CHINESE_SIMPLIFIED},
  };

  // 预热:触发静态表初始化,避免首条测试计时包含冷启动开销
  for (const auto &test : tests) {
    (void)LanguageDetectorW::DetectLangId(test.text);
  }

  // 打印表头
  std::cout << "\n";
  constexpr int COL_TYPE = 14;
  constexpr int COL_PREVIEW = 32;
  constexpr int COL_RESULT = 10;
  constexpr int COL_EXPECTED = 10;
  constexpr int COL_STATUS = 8;
  constexpr int COL_TIME = 13;

  std::cout << "+--------------+--------------------------------+----------+---"
               "-------+--------+-------------+\n";
  std::cout << "|" << FitOrPadCellW(L"类型", COL_TYPE) << "|"
            << FitOrPadCellW(L"文本预览", COL_PREVIEW) << "|"
            << FitOrPadCellW(L"结果", COL_RESULT) << "|"
            << FitOrPadCellW(L"预期", COL_EXPECTED) << "|"
            << FitOrPadCellW(L"状态", COL_STATUS) << "|"
            << FitOrPadCellW(L"耗时(ns)", COL_TIME) << "|\n";
  std::cout << "+--------------+--------------------------------+----------+---"
               "-------+--------+-------------+\n";

  for (const auto &test : tests) {
    auto start = std::chrono::high_resolution_clock::now();
    uint16_t result = LanguageDetectorW::DetectLangId(test.text);
    auto end = std::chrono::high_resolution_clock::now();
    auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start)
                   .count();

    bool passed = (result == test.expected);

    // 格式化输出
    // 文本预览(按显示宽度截断)
    std::wstring preview_w = GetTextPreviewW(test.text, 30);
    std::cout << "|" << FitOrPadCellW(std::wstring(test.name), COL_TYPE) << "|"
              << FitOrPadCellW(preview_w, COL_PREVIEW) << "|"
              << FitOrPadCellUTF8(FormatLangId(result), COL_RESULT, false)
              << "|"
              << FitOrPadCellUTF8(FormatLangId(test.expected), COL_EXPECTED,
                                  false)
              << "|" << CenterCellASCII(passed ? "PASS" : "FAIL", COL_STATUS)
              << "|" << FitOrPadCellUTF8(std::to_string(dur), COL_TIME, false)
              << "|\n";
  }

  std::cout << "+--------------+--------------------------------+----------+---"
               "-------+--------+-------------+\n";

  // 边界测试
  std::cout << "\n";
  std::cout
      << "+-------------------------------------------------------------+\n";
  std::cout
      << "| 边界测试                                                    |\n";
  std::cout
      << "+-------------------------------------------------------------+\n";

  struct EdgeTest {
    std::wstring text;
    const wchar_t *desc;
  };

  std::vector<EdgeTest> edge_tests = {
      {L"", L"空字符串"},
      {L"1234567890", L"纯数字"},
      {L"!@#$%^&*()", L"纯符号"},
      {L"abc123汉字", L"混合字符串"},
      {L"𠀀𠀁𠀂", L"CJK扩展B区字符"},
  };

  for (const auto &test : edge_tests) {
    uint16_t result = LanguageDetectorW::DetectLangId(test.text);
    std::wstring line = L"文本: \"" + test.text + L"\" - " +
                        std::wstring(test.desc) + L" -> " +
                        std::wstring(LanguageDetectorW::GetLangName(result));
    std::cout << "|" << FitOrPadCellW(line, 61) << "|\n";
  }

  std::cout
      << "+-------------------------------------------------------------+\n";

  // 性能测试
  std::cout << "\n";
  std::cout
      << "+-------------------------------------------------------------+\n";
  std::cout
      << "| 性能测试 (10000次调用)                                      |\n";
  std::cout
      << "+-------------------------------------------------------------+\n";

  std::wstring long_text =
      std::wstring(1000, L'') + L"是一个比较长的测试文本,包含简体中文汉字。";

  (void)LanguageDetectorW::DetectLangId(long_text);
  auto start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 10000; i++) {
    LanguageDetectorW::DetectLangId(long_text);
  }
  auto end = std::chrono::high_resolution_clock::now();
  auto duration_ns =
      std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
  auto duration_ms = duration_ns / 1000000;
  auto avg_ns = duration_ns / 10000;

  std::cout << "| 文本长度: " << std::right << std::setw(8) << long_text.size()
            << " 字符";
  int line_len1 = 14 + (int)std::to_string(long_text.size()).length();
  for (int i = line_len1; i < 55; i++)
    std::cout << " ";
  std::cout << "|\n";

  std::cout << "| 总耗时:   " << std::right << std::setw(8) << duration_ms
            << " ms";
  int line_len2 = 14 + (int)std::to_string(duration_ms).length();
  for (int i = line_len2; i < 55; i++)
    std::cout << " ";
  std::cout << "|\n";

  std::cout << "| 平均:     " << std::right << std::setw(8) << avg_ns
            << " ns/次";
  for (int i = 28; i < 55; i++)
    std::cout << " ";
  std::cout << "|\n";

  std::cout
      << "+-------------------------------------------------------------+\n";

  // 准确率测试
  std::cout << "\n";
  std::cout
      << "+-------------------------------------------------------------+\n";
  std::cout
      << "| 准确率测试                                                  |\n";
  std::cout
      << "+-------------------------------------------------------------+\n";

  struct AccuracyTest {
    std::wstring text;
    uint16_t expected;
  };

  std::vector<AccuracyTest> accuracy_tests = {
      {L"中华人民共和国", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"中華人民共和國", DETECT_LANG_CHINESE_TRADITIONAL},
      {L"计算机科学", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"計算機科學", DETECT_LANG_CHINESE_TRADITIONAL},
      {L"软件开发工程师", DETECT_LANG_CHINESE_SIMPLIFIED},
      {L"軟體開發工程師", DETECT_LANG_CHINESE_TRADITIONAL},
      {L"わたしは学生です", DETECT_LANG_JAPANESE},
      {L"私は学生です", DETECT_LANG_JAPANESE},
      {L"안녕하세요", DETECT_LANG_KOREAN},
      {L"Hello World", DETECT_LANG_ENGLISH_US},
  };

  int correct = 0;
  int failed_count = 0;

  for (const auto &test : accuracy_tests) {
    uint16_t result = LanguageDetectorW::DetectLangId(test.text);
    if (result == test.expected) {
      correct++;
    } else {
      if (failed_count == 0) {
        std::cout << "| 失败的测试:                                            "
                     "    |\n";
        std::cout << "+--------------------------------------------------------"
                     "-----+\n";
      }
      failed_count++;
      std::string text_utf8 = WStringToUTF8(test.text);
      std::string expected_utf8 =
          WCharPtrToUTF8(LanguageDetectorW::GetLangName(test.expected));
      std::string result_utf8 =
          WCharPtrToUTF8(LanguageDetectorW::GetLangName(result));
      std::cout << "|   \"" << text_utf8 << "\"";

      int text_len = 4 + EawDisplayWidthUTF8(text_utf8);
      if (text_len < 30) {
        for (int i = text_len; i < 30; i++)
          std::cout << " ";
      }

      std::cout << "-> 预期: " << expected_utf8;

      int expected_len = 10 + EawDisplayWidthUTF8(expected_utf8);
      if (expected_len < 25) {
        for (int i = expected_len; i < 25; i++)
          std::cout << " ";
      }

      std::cout << "得到: " << result_utf8;

      int result_len = 6 + EawDisplayWidthUTF8(result_utf8);
      for (int i = result_len; i < 55; i++)
        std::cout << " ";

      std::cout << "|\n";
    }
  }

  if (failed_count == 0) {
    std::cout
        << "| 所有测试均通过!                                            |\n";
  }

  std::cout
      << "+-------------------------------------------------------------+\n";
  std::cout << "| 准确率: " << std::right << std::setw(5) << correct << "/"
            << accuracy_tests.size() << " (" << std::fixed
            << std::setprecision(1) << (correct * 100.0 / accuracy_tests.size())
            << "%)";
  int line_len3 = 12 + (int)std::to_string(correct).length() +
                  (int)std::to_string(accuracy_tests.size()).length() + 6;
  for (int i = line_len3; i < 55; i++)
    std::cout << " ";
  std::cout << "|\n";
  std::cout
      << "+-------------------------------------------------------------+\n";

  // LANGID 对照表
  std::cout << "\n";
  std::cout << "+------------------------------+----------+--------------------"
               "---------------+\n";
  std::cout << "| 语言                         | LANGID   | 宏定义             "
               "               |\n";
  std::cout << "+------------------------------+----------+--------------------"
               "---------------+\n";
  std::cout << "| 简体中文                     | 0x0804   | "
               "LANG_CHINESE_SIMPLIFIED           |\n";
  std::cout << "| 繁体中文                     | 0x0404   | "
               "LANG_CHINESE_TRADITIONAL          |\n";
  std::cout << "| 日文                         | 0x0411   | LANG_JAPANESE      "
               "               |\n";
  std::cout << "| 韩文                         | 0x0412   | LANG_KOREAN        "
               "               |\n";
  std::cout << "| 英文                         | 0x0409   | LANG_ENGLISH_US    "
               "               |\n";
  std::cout << "+------------------------------+----------+--------------------"
               "---------------+\n";

  return 0;
}

@cqjjjzr
Copy link
Copy Markdown
Author

cqjjjzr commented May 30, 2026

有个不成熟的想法,根据候选自动切换langid, 测试Demo输出如下,性能应该是够用的,其他语言应该有扩展的机会

中文日文有大量写法完全一致的字词,例如我中文打五笔,而日文也以单词输入为主,那么识别错误率会特别高。重要的是这样在 MS Word 等软件里,即使我全程打中文,也可能被误判,或打日文时出现汉字词中文字体,假名部分日文字体的行为。因此在输入端进行语言检查我认为是完全不可行的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants