action #13720
closedSome jobs fail to shut down properly, eventually time out
0%
Description
I noticed a few jobs on Fedora's production instance which seem like they finished up regularly, but then don't quite manage to shut down fully, and eventually time out as incomplete two hours later:
https://openqa.fedoraproject.org/tests/34295
https://openqa.fedoraproject.org/tests/34324
are recent examples. There's a ~ two hour gap between when they basically finish and when they log "end time", and on the worker host system log I see that they hit the timeout:
Sep 14 02:16:36 qa05.qa.fedoraproject.org worker[1092]: 3683: WORKING 34295
Sep 14 04:17:07 qa05.qa.fedoraproject.org worker[1092]: max job time exceeded, aborting 00034295-fedora-Rawhide-universal-x86_64-BuildFedora-Rawhide-20160913.n.2-install_multi@64bit ...
the jobs get duplicated quite soon into the 2 hour wait - I guess by the 'dead worker' check.
This causes us a real problem, because it messes up the fedmsg 'remaining job count' a bit. Remember how I made the fedmsg plugin emit a count of remaining jobs for the compose, so we know when the last job has completed and can take action on that? Well, it seems that when these stuck jobs exist, they aren't counted as being in a PENDING state - so when the last job besides the stuck jobs completes, the remaining count is '0', and our bot which sends out a summary email fires. But then when the 'stuck' jobs eventually time out, each one causes another fedmsg with a remaining count of '0', and the report email gets duplicated.
This, again, is on our production instance, which is running git ee521289d30ad042506b6dff65f92282731b0d46 , with https://github.com/os-autoinst/openQA/pull/802 backported, the auto-duplication disable commit reverted, and my https://github.com/os-autoinst/openQA/pull/844 patches . So it's possible this has been fixed since, but I figured I'd best report it just in case.
Updated by AdamWill almost 8 years ago
- Status changed from New to Closed
I'm pretty sure this is fixed now, I've checked a couple of our worker hosts and seen only one 'max job time exceeded' since December 2016.