feat(docs): support auto-chunking for oversized plain markdown text by jklasdf651 · Pull Request #964 · larksuite/cli

jklasdf651 · 2026-05-19T07:32:07Z

Description
This PR significantly enhances the robust processing of markdown chunking, specifically addressing edge-case bugs and introducing safety guarantees for oversized documents.

Key Changes

Table Detection Optimization: Switched to alignment-row analysis as the definitive marker, supporting both standard markdown tables with/without leading/trailing pipes.
HTML Scan Loop: Upgraded from a single-token check to a continuous loop scan, preventing potential boundaries-crossing omissions.
Idempotency Guarantee: Introduced short-document pass-through logic (≤10k chars) to eliminate any processing risks for standard payloads.
Error Message Optimization: Refined error text to provide user-friendly, actionable suggestions for manually splitting complex/oversized structural elements.

Test Plan
Comprehensive unit tests have been added to markdown_chunk_test.go covering:

Idempotency: Short content with code blocks, tables, and blockquotes passes through unchanged.
Rejection Boundaries: Document sizes (>10k chars) combined with complex structural elements are correctly intercepted.
Edge-case Markers: Validated alignment rows alone, consecutive pipe variations, less-than comparison operators vs. HTML tags to avoid false positives/negatives.

Summary by CodeRabbit

Release Notes

New Features
- Documents with large Markdown content are now automatically chunked during creation and updates
- Safety validation detects and rejects Markdown containing unsafe structural patterns (code blocks, blockquotes, HTML tags)
- Proper UTF-8 support ensures emoji and international characters are handled correctly during content chunking

CLAassistant · 2026-05-19T07:32:14Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai · 2026-05-19T07:32:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 361f40db-dc37-40a7-adbb-efc0b8da8397

📥 Commits

Reviewing files that changed from the base of the PR and between 4aa61db and 03555e4.

📒 Files selected for processing (4)

shortcuts/doc/docs_create_v2.go
shortcuts/doc/docs_update_v2.go
shortcuts/doc/markdown_chunk.go
shortcuts/doc/markdown_chunk_test.go

📝 Walkthrough

Walkthrough

This PR adds Markdown document chunking to safely split large documents into smaller pieces before upload. It detects unsafe Markdown constructs (code blocks, tables, blockquotes, HTML), splits content by UTF-8-aware paragraph boundaries, and integrates chunking into the document create and update endpoints.

Changes

Markdown Chunking for Document Upload

Layer / File(s)	Summary
Markdown Chunking Core Logic `shortcuts/doc/markdown_chunk.go`	Defines `SafeParagraphLimit` constant and `ErrUnsafeMarkdown` error. Implements `containsUnsafeMarkdown` to scan for unsafe patterns (fenced code blocks, blockquotes, HTML tags, Markdown tables). Splits content by paragraphs via `splitPlainParagraphs`. Chunks oversized paragraphs using UTF-8 rune counting and boundary selection (newline preferred, then space, then byte offset). Main routine `chunkMarkdownForUpload` conditionally processes content. Helper `applyChunkingToBody` integrates chunking into request body maps, returning API-style errors on failure.
Chunking Test Suite `shortcuts/doc/markdown_chunk_test.go`	Comprehensive test coverage: unsafe Markdown detection for code blocks, tables, blockquotes, HTML/callout tags; table alignment-row variants; idempotent behavior for short/safe content; rejection of oversized unsafe content; paragraph splitting with list handling; oversized paragraph splitting respecting rune limits; UTF-8 correctness for Chinese text and emoji; multi-paragraph preservation; internal helper correctness for rune offsets and split-point selection; integration behavior across format types.
Document Create and Update Integration `shortcuts/doc/docs_create_v2.go`, `shortcuts/doc/docs_update_v2.go`	Both endpoints now call `applyChunkingToBody` to preprocess the request body (chunking the `content` field based on `doc-format`). Chunking errors are returned early; successful chunking proceeds to the API call with the updated body.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant CreateHandler as executeCreateV2
  participant Chunker as applyChunkingToBody
  participant API as OpenAPI
  Client->>CreateHandler: Request with doc-format
  CreateHandler->>Chunker: body, "content", doc-format
  Chunker->>Chunker: Check format, extract content
  alt Chunking succeeds
    Chunker->>CreateHandler: Chunked body
    CreateHandler->>API: Updated request
    API->>Client: Response
  else Chunking fails
    Chunker->>CreateHandler: Error
    CreateHandler->>Client: Error
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

size/L, domain/ccm

Suggested reviewers

fangshuyu-768

Poem

📄 A chunky tale of Markdown dreams,
Where long docs split at careful seams,
No fences dare, no tables break—
Safe paragraphs the chunker makes! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding auto-chunking support for oversized plain markdown text, which aligns with the primary purpose of this PR.
Description check	✅ Passed	The description covers the main changes (table detection, HTML scanning, idempotency, error messages) and includes a comprehensive test plan, but does not follow the provided template structure with explicit Summary, Changes, and Test Plan sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fangshuyu-768

Review Summary

Overall the design is solid — the pass-through for short content, rejection of oversized unsafe content, and UTF-8-aware splitting are well thought out. Tests are comprehensive. Below are issues I found, ordered by severity.

Critical

findSplitPoint — minByte is not guaranteed to be on a rune boundary. While currently safe (only ASCII delimiters are searched), this is fragile and should be documented or fixed.
isTableAlignmentRow — false positive on setext-style headings (e.g. --- on its own line is an h2, not a table).

Medium

ErrUnsafeMarkdown says "10,000 characters" but the limit is 10,000 runes.
SafeParagraphLimit name doesn't convey the unit is runes.
chunkMarkdownForUpload re-joins with "\n\n", altering original whitespace — not truly idempotent.
applyChunkingToBody silently ignores non-string content values.

Minor

UTF-8 tests check validity but not round-trip content integrity.

fangshuyu-768 · 2026-05-20T02:36:53Z

+
+var ErrUnsafeMarkdown = errors.New("oversized markdown contains complex structural elements (fenced code blocks, tables, blockquotes, or HTML) that cannot be safely split. Please manually split the content below 10,000 characters before uploading")
+
+func isTableAlignmentRow(line string) bool {


The name SafeParagraphLimit doesn't convey that the unit is runes (Unicode code points), not bytes or grapheme clusters. Consider renaming to SafeParagraphRuneLimit or adding a comment like // SafeParagraphLimit is the maximum number of Unicode code points (runes) per chunk. to make the unit explicit.

fangshuyu-768 · 2026-05-20T02:36:53Z

+
+func isTableAlignmentRow(line string) bool {
+	line = strings.TrimSpace(line)
+	if line == "" || !strings.Contains(line, "-") {


The error message says "below 10,000 characters", but SafeParagraphLimit is actually 10,000 runes (Unicode code points). For CJK text or emoji, a single grapheme cluster may consist of multiple runes, so "characters" can be ambiguous. Suggest changing to "10,000 Unicode code points (runes)" or "10,000 runes" for precision.

fangshuyu-768 · 2026-05-20T02:36:53Z

+		if r != '|' && r != '-' && r != ':' && r != ' ' && r != '\t' {
+			return false
+		}
+	}


False positive on setext-style headings.

isTableAlignmentRow("--- ---") returns true because the line only contains -, , and no disallowed characters. However, in Markdown, a line of --- under text is a setext h2 heading, not a table alignment row:

Section Title ---

This means any oversized document using setext headings will be incorrectly flagged as containing a table and rejected with ErrUnsafeMarkdown.

Suggestion: require at least one | character in the line for it to qualify as a table alignment row, or check the preceding line for pipe characters. This would also fix the test case {"no pipe at all", "--- ---", true} which should likely be false.

fangshuyu-768 · 2026-05-20T02:36:53Z

+func chunkMarkdownForUpload(md string) (string, error) {
+	if md == "" {
+		return "", nil
+	}


minByte is not guaranteed to be on a rune boundary.

minByte = limitByte * 3 / 4 is a pure byte-level arithmetic calculation. If the string contains multi-byte UTF-8 characters (CJK, emoji, etc.), minByte may land in the middle of a rune.

Currently this is safe in practice because the loop only checks for ASCII bytes (\n and ' '), so it will never slice at minByte directly. However, the invariant is implicit and fragile — if someone later adds a non-ASCII split character check, it would silently produce invalid UTF-8.

Suggestion: either add a comment explicitly documenting this invariant, or align minByte to the next rune boundary:

minByte := limitByte * 3 / 4 // Align to rune boundary for minByte < len(s) && !utf8.RuneStart(s[minByte]) { minByte++ }

fangshuyu-768 · 2026-05-20T02:36:53Z

+		return "", ErrUnsafeMarkdown
+	}
+
+	paragraphs := splitPlainParagraphs(md)


When no newline or space is found in the search range, findSplitPoint falls back to limitByte. This is correct because limitByte comes from runeOffset and is guaranteed to be on a rune boundary.

However, the intermediate positions checked in the loops above (i from limitByte-1 down to minByte) are not guaranteed to be on rune boundaries. When findSplitPoint returns i+1 (after finding \n or ), that position happens to be safe because \n and are single-byte ASCII characters, so i+1 is always a valid rune start.

This works today, but the correctness depends on an unstated assumption. Consider adding a comment or a debug assertion:

// Safe because \n and ' ' are single-byte ASCII runes, // so i+1 is always a valid rune boundary. return i + 1

fangshuyu-768 · 2026-05-20T02:36:54Z

+	return strings.Join(chunks, "\n\n"), nil
+}
+
+func applyChunkingToBody(body map[string]interface{}, contentKey, docFormat string) error {


Re-joining with "\n\n" changes the original whitespace structure.

If the original markdown had 3+ blank lines between paragraphs, or if splitOversizedParagraph splits at a newline causing a trailing \n in a chunk, the re-joined output will have different whitespace than the input. This means chunkMarkdownForUpload is not truly idempotent — calling it twice on the same input could produce different results.

For rendering purposes this is usually fine, but it's worth documenting this behavior. Alternatively, you could track the original separator between paragraphs and restore it on re-join.

fangshuyu-768 · 2026-05-20T02:36:54Z

+	}
+	chunked, err := chunkMarkdownForUpload(content)
+	if err != nil {
+		return output.Errorf(output.ExitAPI, "chunk_error", "%v", err)


Silent type assertion ignores non-string content.

content, _ := body[contentKey].(string)

If body[contentKey] exists but is not a string (e.g., []byte, json.RawMessage), the type assertion fails silently and content becomes "". The function then returns nil with no error, effectively skipping chunking for that content.

This could lead to hard-to-debug issues where oversized content silently bypasses chunking. Consider either:

Returning an error when the key exists but isn't a string, or

Logging a warning, or

Adding a comment explaining why silent fallback is intentional.

fangshuyu-768 · 2026-05-20T02:36:54Z

+
+func TestChunkMarkdownChineseUTF8(t *testing.T) {
+	t.Run("Chinese text is split correctly", func(t *testing.T) {
+		longChinese := strings.Repeat("你好世界", 3000)


Missing round-trip content integrity check.

The TestChunkMarkdownChineseUTF8 and TestChunkMarkdownEmojiSafety tests only verify that the output is valid UTF-8, but they don't verify that the total content is preserved after chunking and re-joining. A bug that drops characters would still pass these tests.

Consider adding an assertion that the concatenation of all chunks (minus the added \n\n separators) equals the original input:

joined := strings.ReplaceAll(result, "\n\n", "") original := strings.ReplaceAll(longChinese, "\n\n", "") if joined != original { t.Error("content was lost during chunking") }

Or more precisely, verify that the total rune count is preserved.

feat(docs): support auto-chunking for oversized plain markdown text

03555e4

github-actions Bot added domain/ccm PR touches the ccm domain size/M Single-domain feat or fix with limited business impact labels May 19, 2026

fangshuyu-768 requested changes May 20, 2026

View reviewed changes


		var ErrUnsafeMarkdown = errors.New("oversized markdown contains complex structural elements (fenced code blocks, tables, blockquotes, or HTML) that cannot be safely split. Please manually split the content below 10,000 characters before uploading")

		func isTableAlignmentRow(line string) bool {

Conversation

jklasdf651 commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

CLAassistant commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

fangshuyu-768 left a comment

Choose a reason for hiding this comment

Review Summary

Critical

Medium

Minor

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jklasdf651 commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading