neptune.ai logger produces lots of errors when logging "training/epoch" · Lightning-AI/pytorch-lightning#19679

(4 留言) (3 反應) (0 負責人)Python (3,233 fork)batch import

bughelp wantedlogger: neptune

倉庫指標

Star: (26,687 star)
PR 合併指標: (平均合併 9天 15小時) (30 天內合併 3 個 PR)

描述

Bug description

Neptune logger gives a lot of errors like "[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0"

Those are actually false positives, the "training/epoch" curve in the neptune UI looks fine.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

setup NEPTUNE_API_TOKEN and NEPTUNE_PROJECT first for a proper connection to neptune.ai


import os

import lightning as lit
import torch
from lightning.pytorch.loggers import NeptuneLogger
from torch.utils.data import Dataset, DataLoader


class DummyDataset(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 100

    def __getitem__(self, item):
        return {"image": torch.rand(3, 16, 16), "label": torch.randint(0, 100, (1,))}


class DummyModel(lit.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = torch.nn.Linear(3 * 16 * 16, 100)
        self.epoch_identifier = "dummy"

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch["image"], batch["label"]
        x = x.view(x.size(0), -1)
        y = y.view(-1)
        logits = self.model(x)
        loss = torch.nn.functional.cross_entropy(logits, y)
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

    def validation_step(self, batch, batch_idx):
        x, y = batch["image"], batch["label"]
        x = x.view(x.size(0), -1)
        y = y.view(-1)
        logits = self.model(x)
        loss = torch.nn.functional.cross_entropy(logits, y)
        self.log("val_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)

    def test_step(self, batch, batch_idx):
        return self.validation_step(batch, batch_idx)


def main():
    wlogger = NeptuneLogger(log_model_checkpoints=False)
    output_dir = "temp_lit"
    os.makedirs(output_dir, exist_ok=True)

    trainer = lit.Trainer(
        devices=1,
        default_root_dir=output_dir,
        logger=wlogger,
        max_epochs=5,
        enable_progress_bar=False,
        log_every_n_steps=5,
    )
    model = DummyModel()
    dataset = DummyDataset()
    train_loader = DataLoader(dataset, batch_size=16, num_workers=4)
    val_loader = DataLoader(dataset, batch_size=16, num_workers=4)
    trainer.fit(model, train_loader, val_loader)


if __name__ == "__main__":
    main()



### Error messages and logs

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs [neptune] [info ] Neptune initialized. Open in the app: https://app.neptune.ai/ [...] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

| Name | Type | Params

0 | model | Linear | 76.9 K

76.9 K Trainable params 0 Non-trainable params 76.9 K Total params 0.308 Total estimated model params size (MB) [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 6.0 [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 13.0 [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 20.0 [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 27.0 Trainer.fit stopped: max_epochs=5 reached. [neptune] [info ] Shutting down background jobs, please wait a moment... [neptune] [info ] Done! [neptune] [info ] Waiting for the remaining 17 operations to synchronize with Neptune. Do not kill this process. [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0 [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0 [neptune] [info ] All 17 operations synced, thanks for waiting! [neptune] [info ] Explore the metadata in the Neptune app: https://app.neptune.ai/ [...]



### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:
		- NVIDIA GeForce GTX 1070 Ti
		- Quadro P400
	- available:         True
	- version:           12.1
* Lightning:
	- lightning:         2.2.1
	- lightning-utilities: 0.11.0
	- pytorch-lightning: 2.2.1
	- torch:             2.2.1
	- torchaudio:        2.2.1
	- torchmetrics:      1.3.2
	- torchvision:       0.17.1
* Packages:
	- aiohttp:           3.9.3
	- aiosignal:         1.3.1
	- arrow:             1.3.0
	- async-timeout:     4.0.3
	- attrs:             23.2.0
	- boto3:             1.34.66
	- botocore:          1.34.66
	- bravado:           11.0.3
	- bravado-core:      6.1.1
	- brotli:            1.0.9
	- certifi:           2024.2.2
	- charset-normalizer: 2.0.4
	- click:             8.1.7
	- filelock:          3.13.1
	- fqdn:              1.5.1
	- frozenlist:        1.4.1
	- fsspec:            2024.3.1
	- future:            1.0.0
	- gitdb:             4.0.11
	- gitpython:         3.1.42
	- gmpy2:             2.1.2
	- idna:              3.4
	- isoduration:       20.11.0
	- jinja2:            3.1.3
	- jmespath:          1.0.1
	- jsonpointer:       2.4
	- jsonref:           1.1.0
	- jsonschema:        4.21.1
	- jsonschema-specifications: 2023.12.1
	- lightning:         2.2.1
	- lightning-utilities: 0.11.0
	- markupsafe:        2.1.3
	- mkl-fft:           1.3.8
	- mkl-random:        1.2.4
	- mkl-service:       2.4.0
	- monotonic:         1.6
	- mpmath:            1.3.0
	- msgpack:           1.0.8
	- multidict:         6.0.5
	- neptune:           1.9.1
	- networkx:          3.1
	- numpy:             1.26.4
	- oauthlib:          3.2.2
	- packaging:         24.0
	- pandas:            2.2.1
	- pillow:            10.2.0
	- pip:               23.3.1
	- psutil:            5.9.8
	- pyjwt:             2.8.0
	- pysocks:           1.7.1
	- python-dateutil:   2.9.0.post0
	- pytorch-lightning: 2.2.1
	- pytz:              2024.1
	- pyyaml:            6.0.1
	- referencing:       0.34.0
	- requests:          2.31.0
	- requests-oauthlib: 1.4.0
	- rfc3339-validator: 0.1.4
	- rfc3986-validator: 0.1.1
	- rpds-py:           0.18.0
	- s3transfer:        0.10.1
	- setuptools:        68.2.2
	- simplejson:        3.19.2
	- six:               1.16.0
	- smmap:             5.0.1
	- swagger-spec-validator: 3.0.3
	- sympy:             1.12
	- torch:             2.2.1
	- torchaudio:        2.2.1
	- torchmetrics:      1.3.2
	- torchvision:       0.17.1
	- tqdm:              4.66.2
	- triton:            2.2.0
	- types-python-dateutil: 2.9.0.20240316
	- typing-extensions: 4.9.0
	- tzdata:            2024.1
	- uri-template:      1.3.0
	- urllib3:           2.1.0
	- webcolors:         1.13
	- websocket-client:  1.7.0
	- wheel:             0.41.2
	- yarl:              1.9.4
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.10.13
	- release:           5.4.0-172-generic
	- version:           #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024

</details>

### More info

_No response_

貢獻者指南

研究方向: 調查Neptune記錄器處理'training/epoch'指標的方式。錯誤表明步驟不是嚴格遞增的，可能是由於重複記錄或步驟分配錯誤。檢查pytorch lightning的NeptuneLogger集成代碼，了解步驟如何計算並傳遞給Neptune。
技術棧: pythonpytorch
領域: backend
議題類型: 錯誤
難度: 2
預計時間: 1-3 小時
活動狀態: 活躍
清晰度: 大致清晰
前置要求: PythonPyTorch
新手友善度: 65