mozilla/DeepSpeech
View on GitHubTokenize training transcripts by grapheme (cluster) instead of codepoint
Open
#811 opened on Sep 5, 2017
2 comments (2 comments)1 reaction (1 reaction)1 assignee (1 assignee)C++26,755 stars (26,755 stars)4,093 forks (4,093 forks)batch import
enhancementhelp wanted
Description
This issue does not include a description.
Contributor guide
- Tech stack
- pythontensorflow
- Domain
- machine learning
- Issue type
- feature
- DifficultyEstimated implementation difficulty for a new contributor, from 1 for very small changes to 5 for expert-level work.
- 3
- Estimated timeA rough time range for an experienced contributor to investigate, implement, test, and prepare a pull request.
- 1-2 days
- Activity statusHow available the issue appears right now: fresh, active, stale, blocked, or waiting on maintainer input.
- stale
- ClarityHow clearly the issue explains the expected change, acceptance criteria, and next step.
- mostly clear
- Prerequisites
- Unicode grapheme clustersPythonTensorFlow
- Newbie friendlinessA 1-100 score estimating how approachable this issue is for first-time contributors.
- 25
- Research direction
- The current training pipeline tokenizes transcripts using Unicode codepoints. The contributor should find the relevant tokenization code in the data preprocessing module (e.g., DeepSpeech.py or util.py) and modify it to segment text by grapheme clusters. Use a library like 'grapheme' or Python's 'unicodedata' to handle cluster boundaries. Ensure the change is backward compatible and test with multilingual datasets to verify correctness. Check issue comments for any additional discussion on implementation details.