tesseract-ocr/tesseract

Need Indic scripts experts to review cleanup code

Open

#1038 opened on Jul 15, 2017

View on GitHub
 (9 comments) (0 reactions) (0 assignees)C++ (74,090 stars) (10,622 forks)batch import
help wanted

Description

If you can help for a particular script, please comment below.

Comments from Ray - copied from https://github.com/tesseract-ocr/tesseract/issues/995 Read the thread for full context.


it would be useful to have any experts in any of the following scripts review the new corpus cleanup code,and make comments:

Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,Malayalam, Sinhala, Thai, Myanmar, Khmer.

There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match.

eg. The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are.

Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed.


https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.h

Contributor guide