DOCX chunkers return 0 chunks when document contains tables (TableItem serialization is empty but consumes `visited`)
#3335 opened on Apr 21, 2026
Description
Bug
Docling successfully converts some .docx files and export_to_text() returns substantial text, but both HierarchicalChunker and HybridChunker return 0 chunks.
This appears to happen when key content is represented as Word tables (often “layout tables”). Debugging suggests a TableItem can serialize to empty text/spans while still populating the shared visited set with a large number of refs. After that, most underlying items are considered “visited”, and no chunkable output is produced.
This reproduces in my environment on:
docling==2.87.0(current project env)- also reproduced on
docling==2.90.0(latest on PyPI as of 2026-04-21)
Observed on a DOCX I can share privately/sanitized (cannot post publicly due to confidentiality):
- Conversion succeeds (
ConversionStatus.SUCCESS) doc.export_to_text()is non-empty (≈ 9k chars)HierarchicalChunker(...).chunk(doc)returns 0 chunksHybridChunker(...).chunk(doc)returns 0 chunks- Serializing the first
TableItemyields:table_serialize_text_len == 0table_serialize_spans_len == 0visited_afterbecomes large (e.g. 58 while total items were ~63)
Expected:
- Chunkers should emit chunks for the document’s text content.
- If a
TableItemcannot be serialized into chunkable output (empty text/spans), it should not “consume”/mark underlying content as visited in a way that prevents child text from being chunked (or it should fall back to chunking table children).
Workaround:
- Excluding TABLE from serializer labels makes the same DOCX produce chunks:
labels = set(MarkdownParams().labels); labels.discard(DocItemLabel.TABLE)HybridChunker(...).chunk(doc, labels=labels)returns >0 chunks- Downside: not ideal as a general default since it changes/omits table formatting for table-heavy docs.
Notes:
- Docling warns about DrawingML + missing DOCX→PDF converters (LibreOffice). In this case,
export_to_text()already contains the relevant text and the issue reproduces even withtraverse_pictures=False, so this seems unrelated to the 0-chunk outcome.
Steps to reproduce
- Install (this reproduces on both 2.87.0 and 2.90.0; below is 2.87.0):
python -m pip install "docling==2.87.0" tiktoken
# also reproduces with:
# python -m pip install "docling==2.90.0" tiktoken
-
Use a DOCX that triggers the issue (table/layout-heavy DOCX). Attach it to the issue if possible (ensure no sensitive content), or I can provide a sanitized sample or share privately.
-
Run (replace
YOUR_SAMPLE.docx):
import io
from pathlib import Path
import tiktoken
from docling.document_converter import DocumentConverter
from docling_core.types.io import DocumentStream
from docling_core.transforms.chunker.hierarchical_chunker import (
ChunkingSerializerProvider,
HierarchicalChunker,
)
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.transforms.serializer.markdown import MarkdownParams
from docling_core.types.doc.document import TableItem
from docling_core.types.doc.labels import DocItemLabel
p = Path(r"YOUR_SAMPLE.docx")
res = DocumentConverter().convert(
DocumentStream(name=p.name, stream=io.BytesIO(p.read_bytes()))
)
doc = res.document
print("status", getattr(res, "status", None))
print("export_to_text_len", len(doc.export_to_text()))
h = list(HierarchicalChunker(merge_list_items=True).chunk(doc))
print("hierarchical_chunk_count", len(h))
hy = HybridChunker(
tokenizer=OpenAITokenizer(
tokenizer=tiktoken.get_encoding("cl100k_base"),
max_tokens=512,
),
merge_peers=True,
)
print("hybrid_chunk_count", len(list(hy.chunk(doc))))
# Inspect table serialization
serializer = ChunkingSerializerProvider().get_serializer(doc)
table = next(
it for it, _lvl in doc.iterate_items(with_groups=True, traverse_pictures=False)
if isinstance(it, TableItem)
)
visited = set()
ser_res = serializer.serialize(item=table, visited=visited)
print("table_serialize_text_len", len(ser_res.text))
print("table_serialize_spans_len", len(ser_res.spans))
print("visited_after", len(visited))
# Workaround experiment: exclude TABLE
labels = set(MarkdownParams().labels)
labels.discard(DocItemLabel.TABLE)
print("hybrid_chunk_count_no_table", len(list(hy.chunk(doc, labels=labels))))
- Observe:
export_to_text_len > 0hierarchical_chunk_count == 0hybrid_chunk_count == 0table_serialize_text_len == 0,table_serialize_spans_len == 0, andvisited_aftergrows largehybrid_chunk_count_no_table > 0
Docling version
Project environment (docling --version):
Docling version: 2.87.0
Docling Core version: 2.73.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.8.0
Python: cpython-313 (3.13.11)
Platform: Windows-11-10.0.26200-SP0
Also reproduced with (docling==2.90.0):
Docling version: 2.90.0
Docling Core version: 2.74.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.9.0
Python: cpython-313 (3.13.11)
Platform: Windows-11-10.0.26200-SP0
Python version
Python 3.13.11