docling-project/docling

Extracting user comments from office documents like (.docx, .pptx, .xlsx) and embedding them into output format (.md, .html).

Open

#2,119 建立於 2025年8月22日

在 GitHub 查看
 (1 留言) (0 反應) (0 負責人)Python (59,751 star) (4,140 fork)batch import
enhancementgood first issuepptxxlsx

描述

Requested feature

Add functionality to extract user comments and discussion threads from Office documents (.docx, .pptx, .xlsx) and embed them into the output formats (.md, .html).

Feature Discription

When users collaborate on Office documents, they often add comments to review content, ask questions, or have discussions. This commentary is valuable context that is directly related to specific passages in the document.

Currently, docling processes the main content but ignores these embedded comments. By discarding this information, the output loses important context, discussions

Proposed Solution

  • Parse Comments: Identify and extract all comments and their corresponding reply threads from the source .docx, .pptx, and .xlsx files.
  • Associate Content: Link each comment thread to the specific text or element it refers to in the document.
  • Embed in Output: Intelligently embed the extracted comments into the final .md or .html output. For example, they could be formatted as footnotes, side notes, or blockquotes adjacent to the relevant content.

example : Input Documnet:

Output:

Python Programming Language Overview

Python is a powerful programming language used for data analysis and web development. Many developers choose {{COMMENT_START:thread_id = 0}}Python{{COMMENT_END:thread_id =0}} because of its simplicity and readability. The Python community is very active and supportive.

{{COMMENT_START:thread_id =1}}Python offers excellent libraries for machine learning{{COMMENT_END:thread_id =2}} and artificial intelligence. Companies like Google and Netflix use Python extensively in their operations. Learning Python can open many career opportunities in technology.


Comments

  • Thread_id : 0: Passage: "Python"

    • Comment 1 by Omkar Musale (2025-07-21 11:01:52 UTC) Why Python specifically over other languages?
  • Thread_id : 1: Passage: "Python offers excellent libraries for machine learning"

    • Comment 1 by Omkar Musale (2025-07-21 11:02:11 UTC) Which libraries are you referring to?

    • Reply 2 by Omkar Musale (2025-07-21 11:02:19 UTC) Are these open source libraries?

    • Reply 3 by Omkar Musale (2025-07-21 11:02:25 UTC) not sure


貢獻者指南