2 留言 (2 留言)1 反應 (1 反應)1 負責人 (1 負責人)C++26,755 star (26,755 star)4,093 fork (4,093 fork)batch import
enhancementhelp wanted
- 議題類型
- feature
- 研究方向
- The current training pipeline tokenizes transcripts using Unicode codepoints. The contributor should find the relevant tokenization code in the data preprocessing module (e.g., DeepSpeech.py or util.py) and modify it to segment text by grapheme clusters. Use a library like 'grapheme' or Python's 'unicodedata' to handle cluster boundaries. Ensure the change is backward compatible and test with multilingual datasets to verify correctness. Check issue comments for any additional discussion on implementation details.