KeyError on (supposedly) unknown tokens during numericalize · pytorch/text#337

(4 comments) (0 reactions) (0 assignees)Python (822 forks)batch import

help wanted

Repository metrics

Stars: (3,396 stars)
PR merge metrics: (No merged PRs in 30d)

Description

Hi guys,

We've been working on refactoring OpenNMT/OpenNMT-py project using our fork Ubiqus/OpenNMT-py project. It includes distributed processing, being compatible with pytorch==0.4, and, what matters most in this case (I guess), is that we switched from epoch to step based training.

Before that we was training a full epoch THEN comes validation THEN saving the checkpoint.

Now, we just counts iterations and validate/save depending on a parameter (each -valid_steps and -save_checkpoint_steps respectively).

While testing on toy sets (using: https://github.com/Ubiqus/OpenNMT-py/blob/master/onmt/tests/test_models.sh) we faced KeyError in vocabularies (.itos list). the key (word) itself isn't important, it's not every time the same. It may be lower case or not. More importantly, we suppose that it's only words that are unkown. In both case it happens when we start an iterator over the dataset

Setup

Using Ubiqus/OpenNMT-py (branch: vocab_error)
Using torchtext=0.2.1

Error during validation

Toy transformer model (it happens in other models as well) with -valid_steps 1000

Trace

Step 1000,  2000; acc:  46.32; ppl:  10.78; xent:   2.38; lr: 0.00395; 50260 / 50342 tok/s;     66 sec
Loading valid dataset from ./data/data.valid.1.pt, number of examples: 2819
Traceback (most recent call last):
  File "train.py", line 41, in <module>
    main(opt)
  File "train.py", line 28, in main
    single_main(opt)
  File "/home/moses/pytorchwork/AAN/U-NMT-bug/train_single.py", line 120, in main
    opt.valid_steps)
  File "/home/moses/pytorchwork/AAN/U-NMT-bug/onmt/trainer.py", line 176, in train
    valid_stats = self.validate(valid_iter)
  File "/home/moses/pytorchwork/AAN/U-NMT-bug/onmt/trainer.py", line 208, in validate
    for batch in valid_iter:
  File "/home/moses/pytorchwork/AAN/U-NMT-bug/onmt/inputters/inputter.py", line 423, in _iter_
    for batch in self.cur_iter:
  File "/usr/local/lib/python3.6/dist-packages/torchtext/data/iterator.py", line 151, in _iter_
    self.train)
  File "/usr/local/lib/python3.6/dist-packages/torchtext/data/batch.py", line 27, in _init_
    setattr(self, name, field.process(batch, device=device, train=train))
  File "/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py", line 188, in process
    tensor = self.numericalize(padded, device=device, train=train)
  File "/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py", line 287, in numericalize
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
  File "/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py", line 287, in <listcomp>
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
  File "/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py", line 287, in <listcomp>
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
KeyError: 'Anand'

How to replicate

(using test_models.sh)

./onmt/tests/test_models.sh set_gpu set_debug transformer

Error after saving

Trace

Start training...
Loading train dataset from ./data/data.train.1.pt, number of examples: 10000
Step 50,  2000; acc:  42.07; ppl:  22.44; xent:   3.11; lr: 0.00625; 54535 / 58156 tok/s;      3 sec
Saving checkpoint /tmp/onmt_tmp_model_step_50.pt
Traceback (most recent call last):
  File "train.py", line 41, in <module>
    main(opt)
  File "train.py", line 28, in main
    single_main(opt)
  File "/home/pltrdy/pytorchwork/unmt/train_single.py", line 120, in main
    opt.valid_steps)
  File "/home/pltrdy/pytorchwork/unmt/onmt/trainer.py", line 141, in train
    for i, batch in enumerate(train_iter):
  File "/home/pltrdy/pytorchwork/unmt/onmt/inputters/inputter.py", line 423, in __iter__
    for batch in self.cur_iter:
  File "/home/pltrdy/anaconda3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 180, in __iter__
    self.train)
  File "/home/pltrdy/anaconda3/lib/python3.6/site-packages/torchtext/data/batch.py", line 22, in __init__
    setattr(self, name, field.process(batch, device=device, train=train))
  File "/home/pltrdy/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py", line 187, in process
    tensor = self.numericalize(padded, device=device, train=train)
  File "/home/pltrdy/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py", line 286, in numericalize
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
  File "/home/pltrdy/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py", line 286, in <listcomp>
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
  File "/home/pltrdy/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py", line 286, in <listcomp>
    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
KeyError: 'seafront'

How to replicate

In onmt/tests/test_models.sh change, line 231 to -save_checkpoint_steps 50 \ then run the same command:

./onmt/tests/test_models.sh set_gpu set_debug transformer

More about the saving.

We save the checkpoint with the vocabulary. To do it we call save_fields_to_vocab(self.fields) (model_saver.py#L111) which is (inputter.py#L70):

def save_fields_to_vocab(fields):
    vocab = []
    for k, f in fields.items():
        if f is not None and 'vocab' in f.__dict__:
            f.vocab.stoi = dict(f.vocab.stoi)
            vocab.append((k, f.vocab))
    return vocab

NOTE: the name of this function isn't relevant anymore.
I'm not sure how this could lead to problems, but it may be related.

Discussions

those exceptions may not be raised depending on -valid_steps and -save_checkpoint_steps values. Therefore the problem may occur because of a side-effect. For example, in the first example, the model is saved without error.
because of (1) this is not really critical. But we think that it may hides larger problems.
During investigations, we found that the vocabularies are getting bigger and bigger during training (we checked len(field.vocab.stoi) that goes from 1000 to 50k+. New tokens are added and mapped to 0 (unk). We're not sure if this is wanted (and why).
I don't think that the problem comes from torchtext, I'm just wondering if all of this makes sense to you and if you could suggest a way to fix it // better way to use torchtext.

Thanks a lot. Paul

Contributor guide

Research direction: Investigate the root cause of KeyError during numericalize when using step based training and saving vocab. Check if vocab expansion during training is intended and how it leads to missing keys.
Tech stack: pythonpytorch
Domain: backenddatamachine learning
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Stale
Clarity: Clear
Prerequisites: PythonPyTorch
Newbie friendliness: 60

Repository metrics

Description

Setup

Error during validation

Trace

How to replicate

Error after saving

Trace

How to replicate

More about the saving.

Discussions

Contributor guide

Get fresh easy issues in your inbox.