enhancementhelp wanted
Description
Similarly to the ParallelEncoder, a ParallelDecoder setup could allow multi-task learning. This should not be too hard to implement but we need to take care of some details:
- support separate values for the decoding parameters (beam_width, length_penalty, etc.),
- parts of SequenceToSequence assume a single output head (e.g. loss computation, reverse vocabulary lookup, exported outputs for model serving, etc. which should be moved in the decoder itself)