good first issue
Description
Thank you very much for this great contribution.
I found the loss of masked LM didn't decrease when it reaches the value around 7. However, in the official tensorflow implementation, the loss of MLM decreases to 1 easily. I think something went wrong in your implementation.
In additional, I found the code can not predict the next sentence correctly. I think the reason is: self.criterion = nn.NLLLoss(ignore_index=0). It can not be used as criterion for sentence prediction because the label of sentence is 1 or 0. We should remove ignore_index=0 for sentence prediction.
I am looking forward to your reply~