docling-project/docling

Extracting user comments from office documents like (.docx, .pptx, .xlsx) and embedding them into output format (.md, .html).

Open

#2,119 opened on 2025年8月22日

GitHub で見る
 (1 comment) (0 reactions) (0 assignees)Python (59,751 stars) (4,140 forks)batch import
enhancementgood first issuepptxxlsx

説明

Requested feature

Add functionality to extract user comments and discussion threads from Office documents (.docx, .pptx, .xlsx) and embed them into the output formats (.md, .html).

Feature Discription

When users collaborate on Office documents, they often add comments to review content, ask questions, or have discussions. This commentary is valuable context that is directly related to specific passages in the document.

Currently, docling processes the main content but ignores these embedded comments. By discarding this information, the output loses important context, discussions

Proposed Solution

  • Parse Comments: Identify and extract all comments and their corresponding reply threads from the source .docx, .pptx, and .xlsx files.
  • Associate Content: Link each comment thread to the specific text or element it refers to in the document.
  • Embed in Output: Intelligently embed the extracted comments into the final .md or .html output. For example, they could be formatted as footnotes, side notes, or blockquotes adjacent to the relevant content.

example : Input Documnet:

Output:

Python Programming Language Overview

Python is a powerful programming language used for data analysis and web development. Many developers choose {{COMMENT_START:thread_id = 0}}Python{{COMMENT_END:thread_id =0}} because of its simplicity and readability. The Python community is very active and supportive.

{{COMMENT_START:thread_id =1}}Python offers excellent libraries for machine learning{{COMMENT_END:thread_id =2}} and artificial intelligence. Companies like Google and Netflix use Python extensively in their operations. Learning Python can open many career opportunities in technology.


Comments

  • Thread_id : 0: Passage: "Python"

    • Comment 1 by Omkar Musale (2025-07-21 11:01:52 UTC) Why Python specifically over other languages?
  • Thread_id : 1: Passage: "Python offers excellent libraries for machine learning"

    • Comment 1 by Omkar Musale (2025-07-21 11:02:11 UTC) Which libraries are you referring to?

    • Reply 2 by Omkar Musale (2025-07-21 11:02:19 UTC) Are these open source libraries?

    • Reply 3 by Omkar Musale (2025-07-21 11:02:25 UTC) not sure


コントリビューターガイド