Fix Bug 66263 — Add support for reading SDT row in tables#971
Fix Bug 66263 — Add support for reading SDT row in tables#971hostedbygnome wants to merge 7 commits intoapache:trunkfrom
Conversation
|
|
would it be possible to provide a test? |
yes, sorry for the delay |
|
Is there anything else required in pr? |
There was a problem hiding this comment.
Pull request overview
Adds support for handling Word SDT (content control) structures that appear at the same level as table rows and runs, so their inner content can be discovered during parsing and text extraction (Bug 66263).
Changes:
- Update
XWPFTableto iterate table child elements and recursively processCTSdtRowto extract innerCTRowcontent. - Update
XWPFParagraphto additionally processCTRelements insideCTSdtBlock/CTSdtRuncontent. - Add regression tests and new sample
.docxfiles for SDT rows in tables and SDT runs in paragraphs.
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
poi-ooxml/src/main/java/org/apache/poi/xwpf/usermodel/XWPFTable.java |
Cursor-based traversal of table children; recursive SDT-row handling |
poi-ooxml/src/main/java/org/apache/poi/xwpf/usermodel/XWPFParagraph.java |
Processes CTRs nested inside SDT content when building run lists |
poi-ooxml/src/test/java/org/apache/poi/xwpf/usermodel/TestXWPFTable.java |
New unit tests for SDT-row table behavior and sample-doc regression |
poi-ooxml/src/test/java/org/apache/poi/xwpf/usermodel/TestXWPFParagraph.java |
New unit tests for SDT run handling + sample-doc regression |
test-data/document/Bug66263-table.docx |
Added sample document covering SDT row in table |
test-data/document/Bug66263-paragraph.docx |
Added sample document covering SDT runs in paragraphs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try (XmlCursor cursor = table.newCursor()) { | ||
| cursor.selectPath("./*"); | ||
| while (cursor.toNextSelection()) { | ||
| XmlObject xmlObject = cursor.getObject(); | ||
| if (xmlObject instanceof CTRow) { | ||
| processCTRow((CTRow)xmlObject); | ||
| } | ||
| else if (xmlObject instanceof CTSdtRow) { | ||
| List<CTRow> rows = new ArrayList<>(); | ||
| collectCTRowsInnerSdtRow((CTSdtRow)xmlObject, rows); | ||
| for (CTRow row : rows) | ||
| { | ||
| processCTRow(row); | ||
| } | ||
| rowText.append(p.getText()); | ||
| } | ||
| } |
There was a problem hiding this comment.
The constructor now adds CTRow instances extracted from CTSdtRow into tableRows. Many mutation/access methods (e.g., removeRow, insertNewTableRow, addRow, getRow(int), getNumberOfRows) assume tableRows indexes map 1:1 to ctTbl’s top-level w:tr array. With SDT-derived rows included, removeRow(pos) will remove the wrong underlying w:tr (or throw) and getRow(int)/getNumberOfRows() become inconsistent with getRows(). Consider either (a) keeping tableRows aligned to top-level w:tr only and handling SDT rows separately for text extraction, or (b) refactoring row operations to remove/insert via the actual row element’s cursor (or SDT wrapper) rather than ctTbl.removeTr(pos) and updating getRow/getNumberOfRows semantics accordingly.
| List<CTRow> rowsInnerSdtContent = sdtContent.getTrList(); | ||
| if (!rowsInnerSdtContent.isEmpty()) { | ||
| rows.addAll(rowsInnerSdtContent); | ||
| return; | ||
| } | ||
|
|
||
| for (CTSdtRow innerSdt : sdtContent.getSdtList()) { | ||
| collectCTRowsInnerSdtRow(innerSdt, rows); |
There was a problem hiding this comment.
collectCTRowsInnerSdtRow returns immediately when sdtContent.getTrList() is non-empty, which skips any nested w:sdt elements that may appear alongside w:tr within the same CTSdtContentRow. This will miss rows in mixed-content SDTs and also loses document order between tr and nested sdt. Iterate through the SDT content children in order (e.g., via an XmlCursor over sdtContent) and collect both CTRow and nested CTSdtRow recursively without the early return.
| List<CTRow> rowsInnerSdtContent = sdtContent.getTrList(); | |
| if (!rowsInnerSdtContent.isEmpty()) { | |
| rows.addAll(rowsInnerSdtContent); | |
| return; | |
| } | |
| for (CTSdtRow innerSdt : sdtContent.getSdtList()) { | |
| collectCTRowsInnerSdtRow(innerSdt, rows); | |
| XmlCursor cursor = sdtContent.newCursor(); | |
| try { | |
| if (!cursor.toFirstChild()) { | |
| return; | |
| } | |
| do { | |
| XmlObject child = cursor.getObject(); | |
| if (child instanceof CTRow) { | |
| rows.add((CTRow) child); | |
| } else if (child instanceof CTSdtRow) { | |
| collectCTRowsInnerSdtRow((CTSdtRow) child, rows); | |
| } | |
| } while (cursor.toNextSibling()); | |
| } finally { | |
| cursor.dispose(); |
| // is an empty table: I add one row and one column as default | ||
| if (initRow && table.sizeOfTrArray() == 0) { | ||
| createEmptyTable(table); | ||
| } |
There was a problem hiding this comment.
The "empty table" check only looks at sizeOfTrArray() == 0. A table that contains only w:sdt row wrappers has no top-level w:tr, but it isn’t actually empty; createEmptyTable(table) will inject an extra blank row when using the default constructor (initRow=true). Consider extending the condition to also check for the presence of SDT rows (e.g., sizeOfSdtArray()==0/getSdtList().isEmpty()), or otherwise detect whether the table already contains any row-like content before creating a default row.
| private void processCTRs(List<CTR> ctrs) { | ||
| if (ctrs == null) { | ||
| return; | ||
| } | ||
| for (CTR ctr : ctrs) { | ||
| if (ctr.getRPr() != null) { |
There was a problem hiding this comment.
processCTRs adds XWPFRun objects for CTRs inside SDT content into the paragraph’s runs list. Many editing APIs (insertNewRun, insertNewHyperlinkRun, etc.) assume every entry in runs corresponds to a CTR whose cursor is in the paragraph (or hyperlink/field) container; for CTRs nested under CTSdtContentRun, insertNewProvidedRun will fail the isCursorInParagraph check and return null for positions that point at these SDT-derived runs. Either avoid adding nested SDT CTRs into runs (keep SDT content accessible via XWPFSDT/iruns), or update the insertion/removal logic and cursor positioning to support runs whose parent is SDT content so that the runs list remains a valid edit surface.
| private void processCTRs(List<CTR> ctrs) { | |
| if (ctrs == null) { | |
| return; | |
| } | |
| for (CTR ctr : ctrs) { | |
| if (ctr.getRPr() != null) { | |
| private boolean isRunInParagraphEditContainer(CTR ctr) { | |
| try (XmlCursor cursor = ctr.newCursor()) { | |
| if (!cursor.toParent()) { | |
| return false; | |
| } | |
| XmlObject parent = cursor.getObject(); | |
| return parent instanceof CTP | |
| || parent instanceof CTHyperlink | |
| || parent instanceof CTSimpleField; | |
| } | |
| } | |
| private void processCTRs(List<CTR> ctrs) { | |
| if (ctrs == null) { | |
| return; | |
| } | |
| for (CTR ctr : ctrs) { | |
| if (ctr.getRPr() != null && isRunInParagraphEditContainer(ctr)) { |
| import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSdtBlock; | ||
| import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSdtContentBlock; |
There was a problem hiding this comment.
These imports are unused in this test class (CTSdtBlock, CTSdtContentBlock) and will cause compilation to fail with "unused import" in Java. Remove them or add coverage that actually uses SDT blocks.
| import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSdtBlock; | |
| import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSdtContentBlock; |
The issue is that there's an SDT at the same level as the table row.
Added support for this case.
CTR elements inside CTSdtRun and CTRow elements inside CTSdtRow are processed recursively.