help wantedstale-excludestopped development
Description
[I hit send too soon on this; I'm updating the comment.]
I think the time might have come to create an 's5b' version of the WSJ setup. WSJ is the oldest setup and the local scripts are not up to the standard of clarity that we usually expect. Some specific issues:
- The dictionary preparation scripts (the larger dictionary) are using some special-purpose scripts that I created a long time ago and probably phonetisaurus would be a better choice.
- Preparation and cleaning of the language modeling data is done in a way that's mixed up unnecessarily with the dictionary preparation-- better to keep separate things separate.
- The scripts in local/ need to have much clearer and cleaner interfaces- it needs to be clear what the inputs and output are.
- I'm not convinced that I like the way the text is normalized. Lower-case is more common than upper-case in ASR setups these days so let's do it lower case, and use
<unk>(lower-case) instead of<SPOKEN_NOISE>, which is more standard, and make the phone names lower-case instead of upper-case.
Part of my motivation is that we'll be doing some RNNLM stuff with this setup (since we have example scripts for older setups) and the scripts need to be cleaner. I don't know if anyone has the time and inclination to work on this?