Hi,
We’re currently having issues with the job cluster. We’re actively investigating, but all jobs will remain stale with the above message until we’ve fixed this.
Hi,
We’re currently having issues with the job cluster. We’re actively investigating, but all jobs will remain stale with the above message until we’ve fixed this.
Update: the underlying issue is that our network shares are not being mounted into the job nodes, and thus jobs cannot be started. We use these network shares to communicate with jobs (e.g. for training we mount in the features, and the trained model is written to the network share when finished).
The issue has been resolved. We’ll kill all pending jobs, so just retry by hand.
Some background on the issue: yesterday we changed our security group configuration for the pods managing our network shares, and in the process of changing the ingress rules we did not re-add the subnet on which jobs ran. This caused the network mounts to fail when adding new nodes to the job cluster (which happens constantly, as we autoscale up/down job nodes).
Thanks!! everything works