good first issue
描述
I was looking at the main.py code for word-level language modeling and noticed a possible inconsistency. The final evaluation loss is intended to be a mean of the individual losses implemented as a weighted mean of the batches with the weight being the sequence length of the batch.
There are len(data_source)-1 such losses.
In the end, however, the division is performed with len(data_source) causing an inconsistency.
A similar issue also arises with the book-keeping in the training loss. If this is true, the fix should be straightforward, we would need to keep track of total_seen and divide by that instead of some pre-determined quantity in both training and evaluation cases.
Tagging: @Smerity