Skip to content

feat(docs): support auto-chunking for oversized plain markdown text#964

Open
jklasdf651 wants to merge 1 commit into
larksuite:mainfrom
jklasdf651:feat-chunk
Open

feat(docs): support auto-chunking for oversized plain markdown text#964
jklasdf651 wants to merge 1 commit into
larksuite:mainfrom
jklasdf651:feat-chunk

Conversation

@jklasdf651
Copy link
Copy Markdown

@jklasdf651 jklasdf651 commented May 19, 2026

Description
This PR significantly enhances the robust processing of markdown chunking, specifically addressing edge-case bugs and introducing safety guarantees for oversized documents.

Key Changes

  1. Table Detection Optimization: Switched to alignment-row analysis as the definitive marker, supporting both standard markdown tables with/without leading/trailing pipes.
  2. HTML Scan Loop: Upgraded from a single-token check to a continuous loop scan, preventing potential boundaries-crossing omissions.
  3. Idempotency Guarantee: Introduced short-document pass-through logic (≤10k chars) to eliminate any processing risks for standard payloads.
  4. Error Message Optimization: Refined error text to provide user-friendly, actionable suggestions for manually splitting complex/oversized structural elements.

Test Plan
Comprehensive unit tests have been added to markdown_chunk_test.go covering:

  • Idempotency: Short content with code blocks, tables, and blockquotes passes through unchanged.
  • Rejection Boundaries: Document sizes (>10k chars) combined with complex structural elements are correctly intercepted.
  • Edge-case Markers: Validated alignment rows alone, consecutive pipe variations, less-than comparison operators vs. HTML tags to avoid false positives/negatives.

Summary by CodeRabbit

Release Notes

  • New Features
    • Documents with large Markdown content are now automatically chunked during creation and updates
    • Safety validation detects and rejects Markdown containing unsafe structural patterns (code blocks, blockquotes, HTML tags)
    • Proper UTF-8 support ensures emoji and international characters are handled correctly during content chunking

Review Change Stack

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 361f40db-dc37-40a7-adbb-efc0b8da8397

📥 Commits

Reviewing files that changed from the base of the PR and between 4aa61db and 03555e4.

📒 Files selected for processing (4)
  • shortcuts/doc/docs_create_v2.go
  • shortcuts/doc/docs_update_v2.go
  • shortcuts/doc/markdown_chunk.go
  • shortcuts/doc/markdown_chunk_test.go

📝 Walkthrough

Walkthrough

This PR adds Markdown document chunking to safely split large documents into smaller pieces before upload. It detects unsafe Markdown constructs (code blocks, tables, blockquotes, HTML), splits content by UTF-8-aware paragraph boundaries, and integrates chunking into the document create and update endpoints.

Changes

Markdown Chunking for Document Upload

Layer / File(s) Summary
Markdown Chunking Core Logic
shortcuts/doc/markdown_chunk.go
Defines SafeParagraphLimit constant and ErrUnsafeMarkdown error. Implements containsUnsafeMarkdown to scan for unsafe patterns (fenced code blocks, blockquotes, HTML tags, Markdown tables). Splits content by paragraphs via splitPlainParagraphs. Chunks oversized paragraphs using UTF-8 rune counting and boundary selection (newline preferred, then space, then byte offset). Main routine chunkMarkdownForUpload conditionally processes content. Helper applyChunkingToBody integrates chunking into request body maps, returning API-style errors on failure.
Chunking Test Suite
shortcuts/doc/markdown_chunk_test.go
Comprehensive test coverage: unsafe Markdown detection for code blocks, tables, blockquotes, HTML/callout tags; table alignment-row variants; idempotent behavior for short/safe content; rejection of oversized unsafe content; paragraph splitting with list handling; oversized paragraph splitting respecting rune limits; UTF-8 correctness for Chinese text and emoji; multi-paragraph preservation; internal helper correctness for rune offsets and split-point selection; integration behavior across format types.
Document Create and Update Integration
shortcuts/doc/docs_create_v2.go, shortcuts/doc/docs_update_v2.go
Both endpoints now call applyChunkingToBody to preprocess the request body (chunking the content field based on doc-format). Chunking errors are returned early; successful chunking proceeds to the API call with the updated body.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant CreateHandler as executeCreateV2
  participant Chunker as applyChunkingToBody
  participant API as OpenAPI
  Client->>CreateHandler: Request with doc-format
  CreateHandler->>Chunker: body, "content", doc-format
  Chunker->>Chunker: Check format, extract content
  alt Chunking succeeds
    Chunker->>CreateHandler: Chunked body
    CreateHandler->>API: Updated request
    API->>Client: Response
  else Chunking fails
    Chunker->>CreateHandler: Error
    CreateHandler->>Client: Error
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

size/L, domain/ccm

Suggested reviewers

  • fangshuyu-768

Poem

📄 A chunky tale of Markdown dreams,
Where long docs split at careful seams,
No fences dare, no tables break—
Safe paragraphs the chunker makes! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding auto-chunking support for oversized plain markdown text, which aligns with the primary purpose of this PR.
Description check ✅ Passed The description covers the main changes (table detection, HTML scanning, idempotency, error messages) and includes a comprehensive test plan, but does not follow the provided template structure with explicit Summary, Changes, and Test Plan sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added domain/ccm PR touches the ccm domain size/M Single-domain feat or fix with limited business impact labels May 19, 2026
Copy link
Copy Markdown
Collaborator

@fangshuyu-768 fangshuyu-768 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall the design is solid — the pass-through for short content, rejection of oversized unsafe content, and UTF-8-aware splitting are well thought out. Tests are comprehensive. Below are issues I found, ordered by severity.

Critical

  1. findSplitPointminByte is not guaranteed to be on a rune boundary. While currently safe (only ASCII delimiters are searched), this is fragile and should be documented or fixed.
  2. isTableAlignmentRow — false positive on setext-style headings (e.g. --- on its own line is an h2, not a table).

Medium

  1. ErrUnsafeMarkdown says "10,000 characters" but the limit is 10,000 runes.
  2. SafeParagraphLimit name doesn't convey the unit is runes.
  3. chunkMarkdownForUpload re-joins with "\n\n", altering original whitespace — not truly idempotent.
  4. applyChunkingToBody silently ignores non-string content values.

Minor

  1. UTF-8 tests check validity but not round-trip content integrity.


var ErrUnsafeMarkdown = errors.New("oversized markdown contains complex structural elements (fenced code blocks, tables, blockquotes, or HTML) that cannot be safely split. Please manually split the content below 10,000 characters before uploading")

func isTableAlignmentRow(line string) bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name SafeParagraphLimit doesn't convey that the unit is runes (Unicode code points), not bytes or grapheme clusters. Consider renaming to SafeParagraphRuneLimit or adding a comment like // SafeParagraphLimit is the maximum number of Unicode code points (runes) per chunk. to make the unit explicit.


func isTableAlignmentRow(line string) bool {
line = strings.TrimSpace(line)
if line == "" || !strings.Contains(line, "-") {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says "below 10,000 characters", but SafeParagraphLimit is actually 10,000 runes (Unicode code points). For CJK text or emoji, a single grapheme cluster may consist of multiple runes, so "characters" can be ambiguous. Suggest changing to "10,000 Unicode code points (runes)" or "10,000 runes" for precision.

if r != '|' && r != '-' && r != ':' && r != ' ' && r != '\t' {
return false
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive on setext-style headings.

isTableAlignmentRow("--- ---") returns true because the line only contains -, , and no disallowed characters. However, in Markdown, a line of --- under text is a setext h2 heading, not a table alignment row:

Section Title
---

This means any oversized document using setext headings will be incorrectly flagged as containing a table and rejected with ErrUnsafeMarkdown.

Suggestion: require at least one | character in the line for it to qualify as a table alignment row, or check the preceding line for pipe characters. This would also fix the test case {"no pipe at all", "--- ---", true} which should likely be false.

func chunkMarkdownForUpload(md string) (string, error) {
if md == "" {
return "", nil
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minByte is not guaranteed to be on a rune boundary.

minByte = limitByte * 3 / 4 is a pure byte-level arithmetic calculation. If the string contains multi-byte UTF-8 characters (CJK, emoji, etc.), minByte may land in the middle of a rune.

Currently this is safe in practice because the loop only checks for ASCII bytes (\n and ' '), so it will never slice at minByte directly. However, the invariant is implicit and fragile — if someone later adds a non-ASCII split character check, it would silently produce invalid UTF-8.

Suggestion: either add a comment explicitly documenting this invariant, or align minByte to the next rune boundary:

minByte := limitByte * 3 / 4
// Align to rune boundary
for minByte < len(s) && !utf8.RuneStart(s[minByte]) {
    minByte++
}

return "", ErrUnsafeMarkdown
}

paragraphs := splitPlainParagraphs(md)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no newline or space is found in the search range, findSplitPoint falls back to limitByte. This is correct because limitByte comes from runeOffset and is guaranteed to be on a rune boundary.

However, the intermediate positions checked in the loops above (i from limitByte-1 down to minByte) are not guaranteed to be on rune boundaries. When findSplitPoint returns i+1 (after finding \n or ), that position happens to be safe because \n and are single-byte ASCII characters, so i+1 is always a valid rune start.

This works today, but the correctness depends on an unstated assumption. Consider adding a comment or a debug assertion:

// Safe because \n and ' ' are single-byte ASCII runes,
// so i+1 is always a valid rune boundary.
return i + 1

return strings.Join(chunks, "\n\n"), nil
}

func applyChunkingToBody(body map[string]interface{}, contentKey, docFormat string) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-joining with "\n\n" changes the original whitespace structure.

If the original markdown had 3+ blank lines between paragraphs, or if splitOversizedParagraph splits at a newline causing a trailing \n in a chunk, the re-joined output will have different whitespace than the input. This means chunkMarkdownForUpload is not truly idempotent — calling it twice on the same input could produce different results.

For rendering purposes this is usually fine, but it's worth documenting this behavior. Alternatively, you could track the original separator between paragraphs and restore it on re-join.

}
chunked, err := chunkMarkdownForUpload(content)
if err != nil {
return output.Errorf(output.ExitAPI, "chunk_error", "%v", err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent type assertion ignores non-string content.

content, _ := body[contentKey].(string)

If body[contentKey] exists but is not a string (e.g., []byte, json.RawMessage), the type assertion fails silently and content becomes "". The function then returns nil with no error, effectively skipping chunking for that content.

This could lead to hard-to-debug issues where oversized content silently bypasses chunking. Consider either:

  1. Returning an error when the key exists but isn't a string, or
  2. Logging a warning, or
  3. Adding a comment explaining why silent fallback is intentional.


func TestChunkMarkdownChineseUTF8(t *testing.T) {
t.Run("Chinese text is split correctly", func(t *testing.T) {
longChinese := strings.Repeat("你好世界", 3000)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing round-trip content integrity check.

The TestChunkMarkdownChineseUTF8 and TestChunkMarkdownEmojiSafety tests only verify that the output is valid UTF-8, but they don't verify that the total content is preserved after chunking and re-joining. A bug that drops characters would still pass these tests.

Consider adding an assertion that the concatenation of all chunks (minus the added \n\n separators) equals the original input:

joined := strings.ReplaceAll(result, "\n\n", "")
original := strings.ReplaceAll(longChinese, "\n\n", "")
if joined != original {
    t.Error("content was lost during chunking")
}

Or more precisely, verify that the total rune count is preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain/ccm PR touches the ccm domain size/M Single-domain feat or fix with limited business impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants