Need Indic scripts experts to review cleanup code
#1038 opened on Jul 15, 2017
Description
If you can help for a particular script, please comment below.
Comments from Ray - copied from https://github.com/tesseract-ocr/tesseract/issues/995 Read the thread for full context.
it would be useful to have any experts in any of the following scripts review the new corpus cleanup code,and make comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match.
eg. The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are.
Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed.
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.h