All jobs: 'Still waiting for job to be scheduled...'

janjongboom · January 20, 2023, 8:45am

Hi,

We’re currently having issues with the job cluster. We’re actively investigating, but all jobs will remain stale with the above message until we’ve fixed this.

janjongboom · January 20, 2023, 9:17am

Update: the underlying issue is that our network shares are not being mounted into the job nodes, and thus jobs cannot be started. We use these network shares to communicate with jobs (e.g. for training we mount in the features, and the trained model is written to the network share when finished).

janjongboom · January 20, 2023, 9:44am

The issue has been resolved. We’ll kill all pending jobs, so just retry by hand.

Some background on the issue: yesterday we changed our security group configuration for the pods managing our network shares, and in the process of changing the ingress rules we did not re-add the subnet on which jobs ran. This caused the network mounts to fail when adding new nodes to the job cluster (which happens constantly, as we autoscale up/down job nodes).

norik.badalyan · January 20, 2023, 9:47am

Thanks!! everything works