Lightning-AI/pytorch-lightning

Downloading artifacts with wandblogger in DDP case failing on non-zero rank processes

Open

#19,003 opened on 2023年11月15日

GitHub で見る
 (3 comments) (1 reaction) (0 assignees)Python (26,687 stars) (3,233 forks)batch import
bughelp wantedlogger: wandbver: 2.1.x

説明

Bug description

In case using Wandblogger download_artifact function in a DDP case with multiple GPUs - the artifact won't be downloaded in other processes beside the 0-rank process. The function wrapped with the decorator rank_zero_only and thus returning None and not executing the method.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

from lightning.pytorch.loggers import WandbLogger
import lightning as L
import os
from torchvision.transforms import ToTensor
from torchvision.datasets import MNIST
from torch import utils

logger = WandbLogger()

artifact_path = logger.download_artifact(<artifact_name>)

trainer = L.Trainer(logger=logger, accelerator='gpu', devices=[0,1,2])

dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_dataloader = utils.data.DataLoader(dataset)

model = MyLightningModule()
trainer.fit(model, train_dataloader)

# The output on the 0-rank process will be the real path but in other processes will be None
print(artifact_path)

Error messages and logs

# Error messages and logs here please

Environment

  • CUDA: - GPU: - NVIDIA A10G - NVIDIA A10G - NVIDIA A10G - NVIDIA A10G - available: True - version: 12.1
  • Lightning: - lightning: 2.1.0 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.0 - torch: 2.1.0 - torchmetrics: 1.2.0 - torchvision: 0.16.0
  • Packages: - aiohttp: 3.8.6 - aiosignal: 1.3.1 - annotated-types: 0.6.0 - anyio: 4.0.0 - appdirs: 1.4.4 - argon2-cffi: 23.1.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.3.0 - asttokens: 2.4.1 - async-lru: 2.0.4 - async-timeout: 4.0.3 - attrs: 23.1.0 - babel: 2.13.1 - beautifulsoup4: 4.12.2 - bleach: 6.1.0 - bokeh: 3.3.0 - boto3: 1.28.79 - botocore: 1.31.79 - certifi: 2023.7.22 - cffi: 1.16.0 - charset-normalizer: 3.3.2 - click: 8.1.7 - colorcet: 3.0.1 - comm: 0.2.0 - compress-pickle: 2.1.0 - contourpy: 1.2.0 - cython: 0.29.36 - datasets: 2.14.6 - debugpy: 1.8.0 - decorator: 5.1.1 - defusedxml: 0.7.1 - dill: 0.3.7 - docker-pycreds: 0.4.0 - executing: 2.0.1 - fastjsonschema: 2.18.1 - filelock: 3.13.1 - fqdn: 1.5.1 - frozenlist: 1.4.0 - fsspec: 2023.10.0 - gitdb: 4.0.11 - gitpython: 3.1.40 - hdbscan: 0.8.33 - holoviews: 1.18.0 - huggingface-hub: 0.17.3 - idna: 3.4 - ipykernel: 6.26.0 - ipython: 8.17.2 - isoduration: 20.11.0 - jedi: 0.19.1 - jinja2: 3.1.2 - jmespath: 1.0.1 - joblib: 1.3.2 - json5: 0.9.14 - jsonpointer: 2.4 - jsonschema: 4.19.2 - jsonschema-specifications: 2023.7.1 - jupyter-client: 8.6.0 - jupyter-core: 5.5.0 - jupyter-events: 0.9.0 - jupyter-lsp: 2.2.0 - jupyter-server: 2.10.0 - jupyter-server-terminals: 0.4.4 - jupyterlab: 4.0.8 - jupyterlab-pygments: 0.2.2 - jupyterlab-server: 2.25.0 - lightning: 2.1.0 - lightning-utilities: 0.9.0 - linkify-it-py: 2.0.2 - llvmlite: 0.41.1 - lz4: 4.3.2 - markdown: 3.5.1 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - matplotlib-inline: 0.1.6 - mdit-py-plugins: 0.4.0 - mdurl: 0.1.2 - mistune: 3.0.2 - mpmath: 1.3.0 - multidict: 6.0.4 - multiprocess: 0.70.15 - nbclient: 0.9.0 - nbconvert: 7.11.0 - nbformat: 5.9.2 - nest-asyncio: 1.5.8 - networkx: 3.2.1 - nltk: 3.8.1 - notebook: 7.0.6 - notebook-shim: 0.2.3 - numba: 0.58.1 - numpy: 1.26.1 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.18.1 - nvidia-nvjitlink-cu12: 12.3.52 - nvidia-nvtx-cu12: 12.1.105 - overrides: 7.4.0 - packaging: 23.2 - pandas: 2.1.2 - pandocfilters: 1.5.0 - panel: 1.3.1 - param: 2.0.0 - parso: 0.8.3 - pexpect: 4.8.0 - pillow: 10.1.0 - pip: 23.3 - platformdirs: 3.11.0 - prometheus-client: 0.18.0 - prompt-toolkit: 3.0.39 - protobuf: 4.25.0 - psutil: 5.9.6 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pyarrow: 14.0.0 - pycparser: 2.21 - pyct: 0.5.0 - pydantic: 2.4.2 - pydantic-core: 2.10.1 - pydantic-numpy: 4.0.0 - pygments: 2.16.1 - pynndescent: 0.5.10 - python-dateutil: 2.8.2 - python-json-logger: 2.0.7 - pytorch-lightning: 2.1.0 - pytz: 2023.3.post1 - pyviz-comms: 3.0.0 - pyyaml: 6.0.1 - pyzmq: 25.1.1 - quiver: 0.0.2 - referencing: 0.30.2 - regex: 2023.10.3 - requests: 2.31.0 - requests-file: 1.5.1 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rpds-py: 0.12.0 - ruamel.yaml: 0.17.40 - ruamel.yaml.clib: 0.2.8 - s3transfer: 0.7.0 - safetensors: 0.4.0 - scikit-learn: 1.3.2 - scipy: 1.11.3 - semver: 3.0.2 - send2trash: 1.8.2 - sentence-transformers: 2.2.2 - sentencepiece: 0.1.99 - sentry-sdk: 1.34.0 - setproctitle: 1.3.3 - setuptools: 68.0.0 - six: 1.16.0 - smmap: 5.0.1 - sniffio: 1.3.0 - soupsieve: 2.5 - stack-data: 0.6.3 - sympy: 1.12 - tbb: 2021.10.0 - terminado: 0.17.1 - threadpoolctl: 3.2.0 - tinycss2: 1.2.1 - tldextract: 5.1.0 - tokenizers: 0.14.1 - torch: 2.1.0 - torchmetrics: 1.2.0 - torchvision: 0.16.0 - tornado: 6.3.3 - tqdm: 4.66.1 - traitlets: 5.13.0 - transformers: 4.35.0 - triton: 2.1.0 - types-python-dateutil: 2.8.19.14 - typing-extensions: 4.8.0 - tzdata: 2023.3 - uc-micro-py: 1.0.2 - umap-learn: 0.5.4 - uri-template: 1.3.0 - urllib3: 2.0.7 - wandb: 0.16.0 - wcwidth: 0.2.9 - webcolors: 1.13 - webencodings: 0.5.1 - websocket-client: 1.6.4 - wheel: 0.41.2 - xgboost: 2.0.1 - xxhash: 3.4.1 - xyzservices: 2023.10.1 - yarl: 1.9.2
  • System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.11.5 - release: 6.2.0-1015-aws - version: #15~22.04.1-Ubuntu SMP Fri Oct 6 21:37:24 UTC 2023

More info

No response

cc @awaelchli @morganmcg1 @borisdayma @scottire @parambharat

コントリビューターガイド