docling-project/docling

DOCX chunkers return 0 chunks when document contains tables (TableItem serialization is empty but consumes `visited`)

Open

#3335 opened on Apr 21, 2026

View on GitHub
 (3 comments) (0 reactions) (1 assignee)Python (59,751 stars) (4,140 forks)batch import
bugdocxgood first issue

Description

Bug

Docling successfully converts some .docx files and export_to_text() returns substantial text, but both HierarchicalChunker and HybridChunker return 0 chunks.

This appears to happen when key content is represented as Word tables (often “layout tables”). Debugging suggests a TableItem can serialize to empty text/spans while still populating the shared visited set with a large number of refs. After that, most underlying items are considered “visited”, and no chunkable output is produced.

This reproduces in my environment on:

  • docling==2.87.0 (current project env)
  • also reproduced on docling==2.90.0 (latest on PyPI as of 2026-04-21)

Observed on a DOCX I can share privately/sanitized (cannot post publicly due to confidentiality):

  • Conversion succeeds (ConversionStatus.SUCCESS)
  • doc.export_to_text() is non-empty (≈ 9k chars)
  • HierarchicalChunker(...).chunk(doc) returns 0 chunks
  • HybridChunker(...).chunk(doc) returns 0 chunks
  • Serializing the first TableItem yields:
    • table_serialize_text_len == 0
    • table_serialize_spans_len == 0
    • visited_after becomes large (e.g. 58 while total items were ~63)

Expected:

  • Chunkers should emit chunks for the document’s text content.
  • If a TableItem cannot be serialized into chunkable output (empty text/spans), it should not “consume”/mark underlying content as visited in a way that prevents child text from being chunked (or it should fall back to chunking table children).

Workaround:

  • Excluding TABLE from serializer labels makes the same DOCX produce chunks:
    • labels = set(MarkdownParams().labels); labels.discard(DocItemLabel.TABLE)
    • HybridChunker(...).chunk(doc, labels=labels) returns >0 chunks
    • Downside: not ideal as a general default since it changes/omits table formatting for table-heavy docs.

Notes:

  • Docling warns about DrawingML + missing DOCX→PDF converters (LibreOffice). In this case, export_to_text() already contains the relevant text and the issue reproduces even with traverse_pictures=False, so this seems unrelated to the 0-chunk outcome.

Steps to reproduce

  1. Install (this reproduces on both 2.87.0 and 2.90.0; below is 2.87.0):
python -m pip install "docling==2.87.0" tiktoken
# also reproduces with:
# python -m pip install "docling==2.90.0" tiktoken
  1. Use a DOCX that triggers the issue (table/layout-heavy DOCX). Attach it to the issue if possible (ensure no sensitive content), or I can provide a sanitized sample or share privately.

  2. Run (replace YOUR_SAMPLE.docx):

import io
from pathlib import Path

import tiktoken
from docling.document_converter import DocumentConverter
from docling_core.types.io import DocumentStream
from docling_core.transforms.chunker.hierarchical_chunker import (
    ChunkingSerializerProvider,
    HierarchicalChunker,
)
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.transforms.serializer.markdown import MarkdownParams
from docling_core.types.doc.document import TableItem
from docling_core.types.doc.labels import DocItemLabel

p = Path(r"YOUR_SAMPLE.docx")
res = DocumentConverter().convert(
    DocumentStream(name=p.name, stream=io.BytesIO(p.read_bytes()))
)
doc = res.document

print("status", getattr(res, "status", None))
print("export_to_text_len", len(doc.export_to_text()))

h = list(HierarchicalChunker(merge_list_items=True).chunk(doc))
print("hierarchical_chunk_count", len(h))

hy = HybridChunker(
    tokenizer=OpenAITokenizer(
        tokenizer=tiktoken.get_encoding("cl100k_base"),
        max_tokens=512,
    ),
    merge_peers=True,
)
print("hybrid_chunk_count", len(list(hy.chunk(doc))))

# Inspect table serialization
serializer = ChunkingSerializerProvider().get_serializer(doc)
table = next(
    it for it, _lvl in doc.iterate_items(with_groups=True, traverse_pictures=False)
    if isinstance(it, TableItem)
)
visited = set()
ser_res = serializer.serialize(item=table, visited=visited)
print("table_serialize_text_len", len(ser_res.text))
print("table_serialize_spans_len", len(ser_res.spans))
print("visited_after", len(visited))

# Workaround experiment: exclude TABLE
labels = set(MarkdownParams().labels)
labels.discard(DocItemLabel.TABLE)
print("hybrid_chunk_count_no_table", len(list(hy.chunk(doc, labels=labels))))
  1. Observe:
  • export_to_text_len > 0
  • hierarchical_chunk_count == 0
  • hybrid_chunk_count == 0
  • table_serialize_text_len == 0, table_serialize_spans_len == 0, and visited_after grows large
  • hybrid_chunk_count_no_table > 0

Docling version

Project environment (docling --version):

Docling version: 2.87.0
Docling Core version: 2.73.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.8.0
Python: cpython-313 (3.13.11)
Platform: Windows-11-10.0.26200-SP0

Also reproduced with (docling==2.90.0):

Docling version: 2.90.0
Docling Core version: 2.74.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.9.0
Python: cpython-313 (3.13.11)
Platform: Windows-11-10.0.26200-SP0

Python version

Python 3.13.11

Contributor guide