Need Indic scripts experts to review cleanup code · tesseract-ocr/tesseract#1038

Repository metrics

Stars: (74,090 stars)
PR merge metrics: (Avg merge 3d 15h) (14 merged PRs in 30d)

Description

If you can help for a particular script, please comment below.

Comments from Ray - copied from https://github.com/tesseract-ocr/tesseract/issues/995 Read the thread for full context.

it would be useful to have any experts in any of the following scripts review the new corpus cleanup code,and make comments:

Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,Malayalam, Sinhala, Thai, Myanmar, Khmer.

There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match.

eg. The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are.

Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed.

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.h

Contributor guide

Research direction: Review the validation code for Indic scripts in the training directory. Focus on understanding the script specific rules for valid/invalid unicode sequences. Comment on any issues or improvements.
Tech stack: cpp
Domain: backend
Issue type: Research
Difficulty: 4
Estimated time: 1-2 days
Activity status: Stale
Clarity: Clear
Prerequisites: C++Unicode script knowledge
Newbie friendliness: 15

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.