neptune.ai logger produces lots of errors when logging "training/epoch"
#19,679 创建于 2024年3月20日
描述
Bug description
Neptune logger gives a lot of errors like "[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0"
Those are actually false positives, the "training/epoch" curve in the neptune UI looks fine.
similar to https://github.com/Lightning-AI/pytorch-lightning/issues/2946
What version are you seeing the problem on?
v2.2
How to reproduce the bug
setup NEPTUNE_API_TOKEN and NEPTUNE_PROJECT first for a proper connection to neptune.ai
import os
import lightning as lit
import torch
from lightning.pytorch.loggers import NeptuneLogger
from torch.utils.data import Dataset, DataLoader
class DummyDataset(Dataset):
def __init__(self):
pass
def __len__(self):
return 100
def __getitem__(self, item):
return {"image": torch.rand(3, 16, 16), "label": torch.randint(0, 100, (1,))}
class DummyModel(lit.LightningModule):
def __init__(self):
super().__init__()
self.model = torch.nn.Linear(3 * 16 * 16, 100)
self.epoch_identifier = "dummy"
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch["image"], batch["label"]
x = x.view(x.size(0), -1)
y = y.view(-1)
logits = self.model(x)
loss = torch.nn.functional.cross_entropy(logits, y)
self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
def validation_step(self, batch, batch_idx):
x, y = batch["image"], batch["label"]
x = x.view(x.size(0), -1)
y = y.view(-1)
logits = self.model(x)
loss = torch.nn.functional.cross_entropy(logits, y)
self.log("val_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
def test_step(self, batch, batch_idx):
return self.validation_step(batch, batch_idx)
def main():
wlogger = NeptuneLogger(log_model_checkpoints=False)
output_dir = "temp_lit"
os.makedirs(output_dir, exist_ok=True)
trainer = lit.Trainer(
devices=1,
default_root_dir=output_dir,
logger=wlogger,
max_epochs=5,
enable_progress_bar=False,
log_every_n_steps=5,
)
model = DummyModel()
dataset = DummyDataset()
train_loader = DataLoader(dataset, batch_size=16, num_workers=4)
val_loader = DataLoader(dataset, batch_size=16, num_workers=4)
trainer.fit(model, train_loader, val_loader)
if __name__ == "__main__":
main()
### Error messages and logs
GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs [neptune] [info ] Neptune initialized. Open in the app: https://app.neptune.ai/ [...] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
0 | model | Linear | 76.9 K
76.9 K Trainable params
0 Non-trainable params
76.9 K Total params
0.308 Total estimated model params size (MB)
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 6.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 13.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 20.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 27.0
Trainer.fit stopped: max_epochs=5 reached.
[neptune] [info ] Shutting down background jobs, please wait a moment...
[neptune] [info ] Done!
[neptune] [info ] Waiting for the remaining 17 operations to synchronize with Neptune. Do not kill this process.
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0
[neptune] [info ] All 17 operations synced, thanks for waiting!
[neptune] [info ] Explore the metadata in the Neptune app: https://app.neptune.ai/ [...]
### Environment
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- NVIDIA GeForce GTX 1070 Ti
- Quadro P400
- available: True
- version: 12.1
* Lightning:
- lightning: 2.2.1
- lightning-utilities: 0.11.0
- pytorch-lightning: 2.2.1
- torch: 2.2.1
- torchaudio: 2.2.1
- torchmetrics: 1.3.2
- torchvision: 0.17.1
* Packages:
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- arrow: 1.3.0
- async-timeout: 4.0.3
- attrs: 23.2.0
- boto3: 1.34.66
- botocore: 1.34.66
- bravado: 11.0.3
- bravado-core: 6.1.1
- brotli: 1.0.9
- certifi: 2024.2.2
- charset-normalizer: 2.0.4
- click: 8.1.7
- filelock: 3.13.1
- fqdn: 1.5.1
- frozenlist: 1.4.1
- fsspec: 2024.3.1
- future: 1.0.0
- gitdb: 4.0.11
- gitpython: 3.1.42
- gmpy2: 2.1.2
- idna: 3.4
- isoduration: 20.11.0
- jinja2: 3.1.3
- jmespath: 1.0.1
- jsonpointer: 2.4
- jsonref: 1.1.0
- jsonschema: 4.21.1
- jsonschema-specifications: 2023.12.1
- lightning: 2.2.1
- lightning-utilities: 0.11.0
- markupsafe: 2.1.3
- mkl-fft: 1.3.8
- mkl-random: 1.2.4
- mkl-service: 2.4.0
- monotonic: 1.6
- mpmath: 1.3.0
- msgpack: 1.0.8
- multidict: 6.0.5
- neptune: 1.9.1
- networkx: 3.1
- numpy: 1.26.4
- oauthlib: 3.2.2
- packaging: 24.0
- pandas: 2.2.1
- pillow: 10.2.0
- pip: 23.3.1
- psutil: 5.9.8
- pyjwt: 2.8.0
- pysocks: 1.7.1
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.2.1
- pytz: 2024.1
- pyyaml: 6.0.1
- referencing: 0.34.0
- requests: 2.31.0
- requests-oauthlib: 1.4.0
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rpds-py: 0.18.0
- s3transfer: 0.10.1
- setuptools: 68.2.2
- simplejson: 3.19.2
- six: 1.16.0
- smmap: 5.0.1
- swagger-spec-validator: 3.0.3
- sympy: 1.12
- torch: 2.2.1
- torchaudio: 2.2.1
- torchmetrics: 1.3.2
- torchvision: 0.17.1
- tqdm: 4.66.2
- triton: 2.2.0
- types-python-dateutil: 2.9.0.20240316
- typing-extensions: 4.9.0
- tzdata: 2024.1
- uri-template: 1.3.0
- urllib3: 2.1.0
- webcolors: 1.13
- websocket-client: 1.7.0
- wheel: 0.41.2
- yarl: 1.9.4
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.13
- release: 5.4.0-172-generic
- version: #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
</details>
### More info
_No response_