skypilot-org/skypilot

[Spot] An option for keeping failed spot job for a while before termination

Open

#1163 opened on Sep 11, 2022

View on GitHub
 (4 comments) (0 reactions) (0 assignees)Python (4,859 stars) (311 forks)batch import
P1feature-requesthelp wanted

Description

When something wrong happens with the spot job, it would be nice to be able to log into the spot cluster to take a look at the problem. As proposed by @lhqing, having an option like --keep-minutes-after-error 60 for the spot launch can be useful for debugging.

Contributor guide