docling-project/docling

DOCX chunkers return 0 chunks when document contains tables (TableItem serialization is empty but consumes `visited`)

Open

#3,335 opened on 2026年4月21日

GitHub で見る
 (3 comments) (0 reactions) (1 assignee)Python (59,751 stars) (4,140 forks)batch import
bugdocxgood first issue

説明

Bug

Docling successfully converts some .docx files and export_to_text() returns substantial text, but both HierarchicalChunker and HybridChunker return 0 chunks.

This appears to happen when key content is represented as Word tables (often “layout tables”). Debugging suggests a TableItem can serialize to empty text/spans while still populating the shared visited set with a large number of refs. After that, most underlying items are considered “visited”, and no chunkable output is produced.

This reproduces in my environment on:

  • docling==2.87.0 (current project env)
  • also reproduced on docling==2.90.0 (latest on PyPI as of 2026-04-21)

Observed on a DOCX I can share privately/sanitized (cannot post publicly due to confidentiality):

  • Conversion succeeds (ConversionStatus.SUCCESS)
  • doc.export_to_text() is non-empty (≈ 9k chars)
  • HierarchicalChunker(...).chunk(doc) returns 0 chunks
  • HybridChunker(...).chunk(doc) returns 0 chunks
  • Serializing the first TableItem yields:
    • table_serialize_text_len == 0
    • table_serialize_spans_len == 0
    • visited_after becomes large (e.g. 58 while total items were ~63)

Expected:

  • Chunkers should emit chunks for the document’s text content.
  • If a TableItem cannot be serialized into chunkable output (empty text/spans), it should not “consume”/mark underlying content as visited in a way that prevents child text from being chunked (or it should fall back to chunking table children).

Workaround:

  • Excluding TABLE from serializer labels makes the same DOCX produce chunks:
    • labels = set(MarkdownParams().labels); labels.discard(DocItemLabel.TABLE)
    • HybridChunker(...).chunk(doc, labels=labels) returns >0 chunks
    • Downside: not ideal as a general default since it changes/omits table formatting for table-heavy docs.

Notes:

  • Docling warns about DrawingML + missing DOCX→PDF converters (LibreOffice). In this case, export_to_text() already contains the relevant text and the issue reproduces even with traverse_pictures=False, so this seems unrelated to the 0-chunk outcome.

Steps to reproduce

  1. Install (this reproduces on both 2.87.0 and 2.90.0; below is 2.87.0):
python -m pip install "docling==2.87.0" tiktoken
# also reproduces with:
# python -m pip install "docling==2.90.0" tiktoken
  1. Use a DOCX that triggers the issue (table/layout-heavy DOCX). Attach it to the issue if possible (ensure no sensitive content), or I can provide a sanitized sample or share privately.

  2. Run (replace YOUR_SAMPLE.docx):

import io
from pathlib import Path

import tiktoken
from docling.document_converter import DocumentConverter
from docling_core.types.io import DocumentStream
from docling_core.transforms.chunker.hierarchical_chunker import (
    ChunkingSerializerProvider,
    HierarchicalChunker,
)
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
from docling_core.transforms.serializer.markdown import MarkdownParams
from docling_core.types.doc.document import TableItem
from docling_core.types.doc.labels import DocItemLabel

p = Path(r"YOUR_SAMPLE.docx")
res = DocumentConverter().convert(
    DocumentStream(name=p.name, stream=io.BytesIO(p.read_bytes()))
)
doc = res.document

print("status", getattr(res, "status", None))
print("export_to_text_len", len(doc.export_to_text()))

h = list(HierarchicalChunker(merge_list_items=True).chunk(doc))
print("hierarchical_chunk_count", len(h))

hy = HybridChunker(
    tokenizer=OpenAITokenizer(
        tokenizer=tiktoken.get_encoding("cl100k_base"),
        max_tokens=512,
    ),
    merge_peers=True,
)
print("hybrid_chunk_count", len(list(hy.chunk(doc))))

# Inspect table serialization
serializer = ChunkingSerializerProvider().get_serializer(doc)
table = next(
    it for it, _lvl in doc.iterate_items(with_groups=True, traverse_pictures=False)
    if isinstance(it, TableItem)
)
visited = set()
ser_res = serializer.serialize(item=table, visited=visited)
print("table_serialize_text_len", len(ser_res.text))
print("table_serialize_spans_len", len(ser_res.spans))
print("visited_after", len(visited))

# Workaround experiment: exclude TABLE
labels = set(MarkdownParams().labels)
labels.discard(DocItemLabel.TABLE)
print("hybrid_chunk_count_no_table", len(list(hy.chunk(doc, labels=labels))))
  1. Observe:
  • export_to_text_len > 0
  • hierarchical_chunk_count == 0
  • hybrid_chunk_count == 0
  • table_serialize_text_len == 0, table_serialize_spans_len == 0, and visited_after grows large
  • hybrid_chunk_count_no_table > 0

Docling version

Project environment (docling --version):

Docling version: 2.87.0
Docling Core version: 2.73.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.8.0
Python: cpython-313 (3.13.11)
Platform: Windows-11-10.0.26200-SP0

Also reproduced with (docling==2.90.0):

Docling version: 2.90.0
Docling Core version: 2.74.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.9.0
Python: cpython-313 (3.13.11)
Platform: Windows-11-10.0.26200-SP0

Python version

Python 3.13.11

コントリビューターガイド