Tokenize training transcripts by grapheme (cluster) instead of codepoint · mozilla/DeepSpeech#811

(2 comments) (1 reaction) (1 assignee)C++ (4,093 forks)batch import

enhancementhelp wanted

Repository metrics

This issue does not include a description.

Research direction: Familiarize yourself with the current tokenization code in the DeepSpeech training pipeline (likely in Python). Study how transcripts are tokenized by codepoint. Then research how to tokenize by grapheme clusters using Unicode segmentation (e.g., the grapheme library in Python or ICU in C++). Identify the exact location in the codebase where tokenization occurs and propose changes to split by grapheme clusters instead of codepoints. Consider edge cases like combining characters and emoji sequences.
Tech stack: python
Domain: machine learningaidata
Issue type: Feature
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: PythonUnicode text processingDeepSpeech training pipeline
Newbie friendliness: 55