action #135803
closedhook_scripts apparently stuck for 8h (by now back to good) size:M
Observation¶ shows an increase of inactive minion jobs at an alarming rate. shows 4 jobs for task "hook_script" active most of the time it seems, e.g. it showed when looking:
I triggered a retry on the last three assuming they were stuck but I think that was a wrong assessment.
As a consequence also many investigation jobs and such have not been executed.
Acceptance criteria¶
- AC1: has investigation job results shown
- AC2: An alert for has been considered
- AC3: "inactive" is back to normal levels
- Mitigate and monitor the situation, e.g. manually stopping/retrying any stuck minion jobs, restarting openqa-gru, etc.
- Follow the above minion job URLs to see what the state was
- Crosscheck the current state on the system, e.g. check the process table, attach strace for blocked processes, etc.
- Consider running more than 4 jobs:
# cat /usr/share/openqa/script/openqa-gru
#!/bin/sh -e
[ "$1" = "-h" ] || [ "$1" = "--help" ] && echo "Start openQA GRU service" && exit
exec "$(dirname "$0")"/openqa gru -m production run --reset-locks --jobs 4 --spare 2 --spare-min-priority 10 "$@"
Updated by okurz over 1 year ago
Since I reported the ticket and simply retriggered three minion jobs I found that the minion jobs queue again rapidly decreases, see . Unlikely to be a coincidence.
Updated by okurz over 1 year ago
- Subject changed from hook_scripts apparently stuck for 8h to hook_scripts apparently stuck for 8h (by now back to good)
By now as visible in we are back to good. Still I would appreciate if we can investigate.
Updated by livdywan over 1 year ago
- Subject changed from hook_scripts apparently stuck for 8h (by now back to good) to hook_scripts apparently stuck for 8h (by now back to good) size:M
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Status changed from Workable to Rejected
- Assignee set to okurz
Seems to have been a one-off, good enough