Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions safe_string_codec/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from .safe_string_codec import make_string_arguments_safe, revert_arguments_safe_string

__all__ = [
"make_string_arguments_safe",
"revert_arguments_safe_string",
]
160 changes: 160 additions & 0 deletions safe_string_codec/design_and_prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# 任务原始提示词

我需要设计一对安全字符串的转换函数,可以将字符串中的特殊字符替换为HEXCODE, 作为安全字符串,在应用中使用;并提供安全字符串转换回普通字符串的功能。

请帮我设计这个转换算法,我希望它高效、且考虑全面,能够对所有的字符串转换都支持,尤其是 corner case 的支持。

我希望你生成 python 的代码和测试代码,并将本提示词和设计思路、考虑的corner case 等详细思路文档,都记录下来。 所有的文件放置到根目录下一个合适名称的文件夹

---

# 设计目标

1. **可逆**:任意输入字符串都能 100% 恢复。
2. **安全**:与 CSM 参数关键字规则保持一致,避免消息解析冲突。
3. **高效**:单次线性扫描,时间复杂度 O(n)。
4. **全面**:覆盖 CSM 关键字字符、控制字符、Unicode、非法输入等边界场景。

# 编码算法

## 0) 与 CSM-Wiki VI 名称/参数对齐

- `CSM - Make String Arguments Safe.vi`
- `Argument String`
- `Ignore Argument Type(F)`
- `Safe Argument String`
- `CSM - Revert Arguments-Safe String.vi`
- `Safe Argument String`
- `Force Convert (F)`
- `Origin Argument String`

Python 实现提供同语义接口:

- `make_string_arguments_safe(argument_string, ignore_argument_type=False)`
- `revert_arguments_safe_string(safe_argument_string, force_convert=False)`

## 1) 关键字来源

根据 `CSM -- Revert Arguments Safe StringVI` 的关键字说明,涉及关键模式:

- `->`
- `->|`
- `-@`
- `-&`
- `<-`
- `\r`
- `\n`
- `//`
- `>>`
- `>>>`
- `;`
- `,`

因此,参与关键字的字符集合为:`-|@&<>\r\n/;,`。

## 2) 输入层

输入为 Python `str`,按字符线性扫描。

## 3) 输出字符策略

- 若字符属于关键字字符集合 `-|@&<>\r\n/;,`,转义为 `%HH`(两位十六进制,大写)
- `%` 本身也转义为 `%25`,保证解码无歧义
- 其他字符直接输出(包括普通文本和 Unicode)
- 这里采用**按字符保守转义**(不是按完整 token 匹配):只要字符属于集合就转义,不依赖上下文

例如:
- `->` -> `%2D%3E`
- `>` -> `%3E`(即使单独出现也会转义)
- `;` -> `%3B`
- `,` -> `%2C`
- `%` -> `%25`

## 4) 解码策略

按字符线性扫描:
- 若遇到 `%`,必须紧跟两位十六进制,转为对应字符
- 若不是 `%`,直接输出原字符
- 不完整 `%` 转义、非法十六进制,抛出 `ValueError`

## 5) 两个函数流程图

### `make_string_arguments_safe(argument_string, ignore_argument_type=False)`

```mermaid
flowchart TD
A[开始] --> B{argument_string 是 str?}
B -- 否 --> E1[抛出 TypeError] --> Z[结束]
B -- 是 --> C{ignore_argument_type 是 bool?}
C -- 否 --> E2[抛出 TypeError] --> Z
C -- 是 --> D[逐字符扫描 argument_string]
D --> F{字符在 -|@&<>\\r\\n/;, 或 % ?}
F -- 是 --> G[输出 %HH 大写十六进制]
F -- 否 --> H[原样字符输出]
G --> I{还有下一个字符?}
H --> I
I -- 是 --> F
I -- 否 --> J[拼接 safe_argument_string]
J --> K{ignore_argument_type ?}
K -- 是 --> L[返回 safe_argument_string] --> Z
K -- 否 --> M[返回 <SAFESTR> + safe_argument_string] --> Z
```

### `revert_arguments_safe_string(safe_argument_string, force_convert=False)`

```mermaid
flowchart TD
A[开始] --> B{safe_argument_string 是 str?}
B -- 否 --> E1[抛出 TypeError] --> Z[结束]
B -- 是 --> C{force_convert 是 bool?}
C -- 否 --> E2[抛出 TypeError] --> Z
C -- 是 --> D{以 <SAFESTR> 开头?}
D -- 是 --> E[去掉前缀, 得到 encoded_text]
D -- 否 --> F{force_convert ?}
F -- 否 --> G[原样返回 safe_argument_string] --> Z
F -- 是 --> H[encoded_text = safe_argument_string]
E --> I[逐字符扫描 encoded_text]
H --> I
I --> J{当前字符是 % ?}
J -- 否 --> K[原样字符加入结果]
J -- 是 --> L{后面有两位字符?}
L -- 否 --> E3[抛出 ValueError: 不完整转义] --> Z
L -- 是 --> M{两位都是十六进制?}
M -- 否 --> E4[抛出 ValueError: 非法十六进制] --> Z
M -- 是 --> N[%HH 解码为字符并加入结果]
K --> O{还有下一个字符?}
N --> O
O -- 是 --> J
O -- 否 --> P[返回拼接后的原字符串] --> Z
```

# 为什么该方案无歧义

- `%` 作为唯一转义前缀,固定长度 3(`%HH`)。
- `%` 本身会被编码为 `%25`,不会与原文冲突。
- 解码器可严格校验格式,避免 silent corruption。

# 复杂度与效率

- 编码:单次遍历字符,O(n)
- 解码:单次遍历字符,O(n)
- 仅使用轻量字符串拼接列表,内存开销可控

# 覆盖的 corner cases

1. 空字符串
2. 普通 ASCII 文本(不应被改写)
3. CSM 关键字字符与组合(`->`、`//`、`>>>`、`;`、`,` 等)
4. 转义前缀字符 `%` 本身
5. 控制字符:`\r`、`\n` 以及其他非关键控制字符(如 `\t`、`\x00`)
6. Unicode:中文、emoji
7. 超长混合字符串
8. ASCII 全量 roundtrip(0x00-0x7F)
9. 非法输入处理:不完整转义、非十六进制
10. 入参类型错误(非 `str`)

# 代码与测试文件

- `safe_string_codec/safe_string_codec.py`:核心编码/解码实现
- `safe_string_codec/test_safe_string_codec.py`:`unittest` 全面测试
- `safe_string_codec/__init__.py`:对外导出接口
Comment on lines +156 to +160
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the diff should be documentation-only/minimal doc update, but this change introduces new runtime code and a full unittest suite (codec implementation + exports). Please update the PR description/scope (or split into separate PRs) so reviewers and release notes match what’s actually being merged.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已按该反馈调整 PR 描述范围,现已与实际合并内容一致(实现 + 测试 + 文档,而不仅是最小文档更新)。本次仅更新 PR 描述,无新增代码改动。当前分支提交:3907f2c。无 UI 变更,截图不适用。

95 changes: 95 additions & 0 deletions safe_string_codec/safe_string_codec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
"""Safe string conversion utilities.

This module provides two functions:
- make_string_arguments_safe: CSM - Make String Arguments Safe.vi equivalent.
- revert_arguments_safe_string: CSM - Revert Arguments-Safe String.vi equivalent.

The escaping behavior follows CSM keyword-safe conventions:
- Escape CSM keyword characters using ``%HH`` uppercase hex.
- ``%`` is also escaped to keep decoding unambiguous.
"""

from __future__ import annotations

_HEX_DIGITS = set("0123456789ABCDEFabcdef")
_ESCAPE = "%"
_SAFE_STRING_TYPE = "<SAFESTR>"

Comment thread
nevstop marked this conversation as resolved.
# Character-based conservative escaping for CSM keyword safety.
# We escape all characters that appear in documented keyword patterns
# (->, ->|, -@, -&, <-, \r, \n, //, >>, >>>, ;, ,), regardless of context.
_CSM_KEYWORD_CHARS = set("-|@&<>\r\n/;,")
_ESCAPED_CHARS = _CSM_KEYWORD_CHARS | {_ESCAPE}


def make_string_arguments_safe(argument_string: str, ignore_argument_type: bool = False) -> str:
"""CSM - Make String Arguments Safe.vi equivalent.

Args:
argument_string: String argument.
ignore_argument_type: If True, do not prepend ``<SAFESTR>``.

Returns:
Safe argument string.
"""
if not isinstance(argument_string, str):
raise TypeError("argument_string must be str")
if not isinstance(ignore_argument_type, bool):
raise TypeError("ignore_argument_type must be bool")

encoded_parts: list[str] = []
for ch in argument_string:
if ch in _ESCAPED_CHARS:
encoded_parts.append(f"{_ESCAPE}{ord(ch):02X}")
else:
encoded_parts.append(ch)
safe_argument_string = "".join(encoded_parts)
if ignore_argument_type:
return safe_argument_string
return f"{_SAFE_STRING_TYPE}{safe_argument_string}"


def revert_arguments_safe_string(safe_argument_string: str, force_convert: bool = False) -> str:
"""CSM - Revert Arguments-Safe String.vi equivalent.

Args:
safe_argument_string: Safe string argument.
force_convert: Convert even when argument type is not ``SAFESTR``.

Returns:
Original argument string.

Raises:
ValueError: If input is malformed.
"""
if not isinstance(safe_argument_string, str):
raise TypeError("safe_argument_string must be str")
if not isinstance(force_convert, bool):
raise TypeError("force_convert must be bool")

encoded_text = safe_argument_string
if safe_argument_string.startswith(_SAFE_STRING_TYPE):
encoded_text = safe_argument_string[len(_SAFE_STRING_TYPE) :]
elif not force_convert:
return safe_argument_string

result: list[str] = []
i = 0
length = len(encoded_text)

while i < length:
ch = encoded_text[i]
if ch == _ESCAPE:
if i + 2 >= length:
raise ValueError("Malformed safe string: incomplete escape sequence")
h1, h2 = encoded_text[i + 1], encoded_text[i + 2]
if h1 not in _HEX_DIGITS or h2 not in _HEX_DIGITS:
raise ValueError("Malformed safe string: invalid hex escape")
result.append(chr(int(h1 + h2, 16)))
i += 3
continue

result.append(ch)
i += 1

return "".join(result)
102 changes: 102 additions & 0 deletions safe_string_codec/test_safe_string_codec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
import unittest

from safe_string_codec import make_string_arguments_safe, revert_arguments_safe_string


class SafeStringCodecTests(unittest.TestCase):
def test_package_level_imports_are_available(self):
self.assertEqual(make_string_arguments_safe.__name__, "make_string_arguments_safe")
self.assertEqual(
revert_arguments_safe_string.__name__, "revert_arguments_safe_string"
)

def test_empty_string(self):
self.assertEqual(make_string_arguments_safe("", ignore_argument_type=True), "")
self.assertEqual(revert_arguments_safe_string("", force_convert=True), "")

def test_make_string_arguments_safe_default_adds_type_prefix(self):
safe = make_string_arguments_safe("A->B")
self.assertTrue(safe.startswith("<SAFESTR>"))
self.assertEqual(revert_arguments_safe_string(safe), "A->B")

def test_make_string_arguments_safe_ignore_type(self):
safe = make_string_arguments_safe("A->B", ignore_argument_type=True)
self.assertEqual(safe, "A%2D%3EB")

def test_revert_without_force_and_without_prefix_keeps_input(self):
self.assertEqual(revert_arguments_safe_string("A%2D%3EB"), "A%2D%3EB")

def test_revert_force_convert_without_prefix_decodes(self):
self.assertEqual(
revert_arguments_safe_string("A%2D%3EB", force_convert=True), "A->B"
)

def test_ascii_alphanumeric_and_space_kept(self):
original = "AbcXYZ019_. hello"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertEqual(safe, original)
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_csm_keyword_characters_are_encoded(self):
original = "->| -@ -& <-\r\n// >> >>> ;,"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertNotIn("->", safe)
self.assertNotIn(">>", safe)
self.assertNotEqual(safe, original)
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_percent_character_is_always_encoded(self):
original = "%"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertEqual(safe, "%25")
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_unicode_chinese_and_emoji(self):
original = "中文😀"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_unicode_with_adjacent_keywords(self):
original = "前缀->中文😀//后缀"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertNotIn("->", safe)
self.assertNotIn("//", safe)
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_control_and_null_characters(self):
original = "line1\nline2\t\x00end"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_long_mixed_string(self):
original = "A" * 1000 + "🚀" + "\x00" + "終"
safe = make_string_arguments_safe(original, ignore_argument_type=True)
self.assertEqual(revert_arguments_safe_string(safe, force_convert=True), original)

def test_roundtrip_for_all_ascii_values(self):
original = "".join(chr(i) for i in range(128))
safe = make_string_arguments_safe(original, ignore_argument_type=True)
restored = revert_arguments_safe_string(safe, force_convert=True)
self.assertEqual(restored, original)

def test_decode_rejects_incomplete_escape(self):
with self.assertRaises(ValueError):
revert_arguments_safe_string("abc%", force_convert=True)

def test_decode_rejects_bad_hex(self):
with self.assertRaises(ValueError):
revert_arguments_safe_string("abc%G1", force_convert=True)

def test_type_errors(self):
with self.assertRaises(TypeError):
make_string_arguments_safe(None) # type: ignore[arg-type]
with self.assertRaises(TypeError):
revert_arguments_safe_string(None) # type: ignore[arg-type]
with self.assertRaises(TypeError):
make_string_arguments_safe("ok", ignore_argument_type=None) # type: ignore[arg-type]
with self.assertRaises(TypeError):
revert_arguments_safe_string("ok", force_convert=None) # type: ignore[arg-type]


if __name__ == "__main__":
unittest.main()