feat(docs): support auto-chunking for oversized plain markdown text#964
feat(docs): support auto-chunking for oversized plain markdown text#964jklasdf651 wants to merge 1 commit into
Conversation
|
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR adds Markdown document chunking to safely split large documents into smaller pieces before upload. It detects unsafe Markdown constructs (code blocks, tables, blockquotes, HTML), splits content by UTF-8-aware paragraph boundaries, and integrates chunking into the document create and update endpoints. ChangesMarkdown Chunking for Document Upload
Sequence Diagram(s)sequenceDiagram
participant Client
participant CreateHandler as executeCreateV2
participant Chunker as applyChunkingToBody
participant API as OpenAPI
Client->>CreateHandler: Request with doc-format
CreateHandler->>Chunker: body, "content", doc-format
Chunker->>Chunker: Check format, extract content
alt Chunking succeeds
Chunker->>CreateHandler: Chunked body
CreateHandler->>API: Updated request
API->>Client: Response
else Chunking fails
Chunker->>CreateHandler: Error
CreateHandler->>Client: Error
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
fangshuyu-768
left a comment
There was a problem hiding this comment.
Review Summary
Overall the design is solid — the pass-through for short content, rejection of oversized unsafe content, and UTF-8-aware splitting are well thought out. Tests are comprehensive. Below are issues I found, ordered by severity.
Critical
findSplitPoint—minByteis not guaranteed to be on a rune boundary. While currently safe (only ASCII delimiters are searched), this is fragile and should be documented or fixed.isTableAlignmentRow— false positive on setext-style headings (e.g.---on its own line is an h2, not a table).
Medium
ErrUnsafeMarkdownsays "10,000 characters" but the limit is 10,000 runes.SafeParagraphLimitname doesn't convey the unit is runes.chunkMarkdownForUploadre-joins with"\n\n", altering original whitespace — not truly idempotent.applyChunkingToBodysilently ignores non-string content values.
Minor
- UTF-8 tests check validity but not round-trip content integrity.
|
|
||
| var ErrUnsafeMarkdown = errors.New("oversized markdown contains complex structural elements (fenced code blocks, tables, blockquotes, or HTML) that cannot be safely split. Please manually split the content below 10,000 characters before uploading") | ||
|
|
||
| func isTableAlignmentRow(line string) bool { |
There was a problem hiding this comment.
The name SafeParagraphLimit doesn't convey that the unit is runes (Unicode code points), not bytes or grapheme clusters. Consider renaming to SafeParagraphRuneLimit or adding a comment like // SafeParagraphLimit is the maximum number of Unicode code points (runes) per chunk. to make the unit explicit.
|
|
||
| func isTableAlignmentRow(line string) bool { | ||
| line = strings.TrimSpace(line) | ||
| if line == "" || !strings.Contains(line, "-") { |
There was a problem hiding this comment.
The error message says "below 10,000 characters", but SafeParagraphLimit is actually 10,000 runes (Unicode code points). For CJK text or emoji, a single grapheme cluster may consist of multiple runes, so "characters" can be ambiguous. Suggest changing to "10,000 Unicode code points (runes)" or "10,000 runes" for precision.
| if r != '|' && r != '-' && r != ':' && r != ' ' && r != '\t' { | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
False positive on setext-style headings.
isTableAlignmentRow("--- ---") returns true because the line only contains -, , and no disallowed characters. However, in Markdown, a line of --- under text is a setext h2 heading, not a table alignment row:
Section Title
---This means any oversized document using setext headings will be incorrectly flagged as containing a table and rejected with ErrUnsafeMarkdown.
Suggestion: require at least one | character in the line for it to qualify as a table alignment row, or check the preceding line for pipe characters. This would also fix the test case {"no pipe at all", "--- ---", true} which should likely be false.
| func chunkMarkdownForUpload(md string) (string, error) { | ||
| if md == "" { | ||
| return "", nil | ||
| } |
There was a problem hiding this comment.
minByte is not guaranteed to be on a rune boundary.
minByte = limitByte * 3 / 4 is a pure byte-level arithmetic calculation. If the string contains multi-byte UTF-8 characters (CJK, emoji, etc.), minByte may land in the middle of a rune.
Currently this is safe in practice because the loop only checks for ASCII bytes (\n and ' '), so it will never slice at minByte directly. However, the invariant is implicit and fragile — if someone later adds a non-ASCII split character check, it would silently produce invalid UTF-8.
Suggestion: either add a comment explicitly documenting this invariant, or align minByte to the next rune boundary:
minByte := limitByte * 3 / 4
// Align to rune boundary
for minByte < len(s) && !utf8.RuneStart(s[minByte]) {
minByte++
}| return "", ErrUnsafeMarkdown | ||
| } | ||
|
|
||
| paragraphs := splitPlainParagraphs(md) |
There was a problem hiding this comment.
When no newline or space is found in the search range, findSplitPoint falls back to limitByte. This is correct because limitByte comes from runeOffset and is guaranteed to be on a rune boundary.
However, the intermediate positions checked in the loops above (i from limitByte-1 down to minByte) are not guaranteed to be on rune boundaries. When findSplitPoint returns i+1 (after finding \n or ), that position happens to be safe because \n and are single-byte ASCII characters, so i+1 is always a valid rune start.
This works today, but the correctness depends on an unstated assumption. Consider adding a comment or a debug assertion:
// Safe because \n and ' ' are single-byte ASCII runes,
// so i+1 is always a valid rune boundary.
return i + 1| return strings.Join(chunks, "\n\n"), nil | ||
| } | ||
|
|
||
| func applyChunkingToBody(body map[string]interface{}, contentKey, docFormat string) error { |
There was a problem hiding this comment.
Re-joining with "\n\n" changes the original whitespace structure.
If the original markdown had 3+ blank lines between paragraphs, or if splitOversizedParagraph splits at a newline causing a trailing \n in a chunk, the re-joined output will have different whitespace than the input. This means chunkMarkdownForUpload is not truly idempotent — calling it twice on the same input could produce different results.
For rendering purposes this is usually fine, but it's worth documenting this behavior. Alternatively, you could track the original separator between paragraphs and restore it on re-join.
| } | ||
| chunked, err := chunkMarkdownForUpload(content) | ||
| if err != nil { | ||
| return output.Errorf(output.ExitAPI, "chunk_error", "%v", err) |
There was a problem hiding this comment.
Silent type assertion ignores non-string content.
content, _ := body[contentKey].(string)If body[contentKey] exists but is not a string (e.g., []byte, json.RawMessage), the type assertion fails silently and content becomes "". The function then returns nil with no error, effectively skipping chunking for that content.
This could lead to hard-to-debug issues where oversized content silently bypasses chunking. Consider either:
- Returning an error when the key exists but isn't a string, or
- Logging a warning, or
- Adding a comment explaining why silent fallback is intentional.
|
|
||
| func TestChunkMarkdownChineseUTF8(t *testing.T) { | ||
| t.Run("Chinese text is split correctly", func(t *testing.T) { | ||
| longChinese := strings.Repeat("你好世界", 3000) |
There was a problem hiding this comment.
Missing round-trip content integrity check.
The TestChunkMarkdownChineseUTF8 and TestChunkMarkdownEmojiSafety tests only verify that the output is valid UTF-8, but they don't verify that the total content is preserved after chunking and re-joining. A bug that drops characters would still pass these tests.
Consider adding an assertion that the concatenation of all chunks (minus the added \n\n separators) equals the original input:
joined := strings.ReplaceAll(result, "\n\n", "")
original := strings.ReplaceAll(longChinese, "\n\n", "")
if joined != original {
t.Error("content was lost during chunking")
}Or more precisely, verify that the total rune count is preserved.
Description
This PR significantly enhances the robust processing of markdown chunking, specifically addressing edge-case bugs and introducing safety guarantees for oversized documents.
Key Changes
Test Plan
Comprehensive unit tests have been added to
markdown_chunk_test.gocovering:Summary by CodeRabbit
Release Notes