2 comments (2 comments)1 reaction (1 reaction)1 assignee (1 assignee)C++26,755 stars (26,755 stars)4,093 forks (4,093 forks)batch import
enhancementhelp wanted
- Issue 種別
- feature
- 調査方針
- The current training pipeline tokenizes transcripts using Unicode codepoints. The contributor should find the relevant tokenization code in the data preprocessing module (e.g., DeepSpeech.py or util.py) and modify it to segment text by grapheme clusters. Use a library like 'grapheme' or Python's 'unicodedata' to handle cluster boundaries. Ensure the change is backward compatible and test with multilingual datasets to verify correctness. Check issue comments for any additional discussion on implementation details.