Small extra blocks with a single letter gets split off of bigger text blocks · tesseract-ocr/tesseract#2634

(1 comment) (0 reactions) (0 assignees)C++ (74,090 stars) (10,622 forks)batch import

help wantedlayout analysis

説明

Environment

Tesseract Version: tesseract 5.0.0-alpha-357-gdc907 leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE
Commit Number: dc90741f1b8f37e8d1a0c919bb679f455bd39633
Platform: Linux jk-XPS-13 5.0.0-25-generic #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

This page of a book gets recognized perfectly except for „die Sache selbst“ (end of third line), which becomes „die Sache selbst‘ (single quote). The other single quote becomes another block containing only the very small character "C" testcase

I'm sorry I could not provide a cropped test image, but for smaller regions the problem disappears.

I'm calling tesseract with default parameters: tesseract testcase.png - -l deu

When called with psm 6 (single uniform block of text) it works, but I don't want to loose the layout information. tesseract testcase.png - -l deu -psm 6

This is of course a minor bug, but maybe it's also easy to fix. It happens like one time in hundred pages. Sometimes footnote numbers get lost the same way. The problem appears at least with tesseract 4.0 / 4.1 / master and in all oem modes.

Expected Behavior:

tesseract should not split of single chars in extra regions

Suggested Fix:

Maybe padding the recognized blocks a bit?

コントリビューターガイド

技術スタック: cpp
領域: backend
Issue 種別: bug
難度: 4
推定時間: 1-2 days
活動状況: stale
明確さ: mostly clear
前提条件: C++ proficiencyTesseract build setupLayout analysis concepts
初心者向け度: 35
調査方針: Examine the block segmentation code in Tesseract, focusing on how text blocks are split. The main files to review are 'ccmain/paragraphs.cpp' and 'ccstruct/blobs.cpp'. The issue suggests that small blocks (single characters) are incorrectly separated; consider adding a merging step or increasing padding. No maintainer response yet; consider asking for a reproducible image or creating one.