Project

General

Profile

Actions

action #135803

closed

hook_scripts apparently stuck for 8h (by now back to good) size:M

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-09-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694660738324&to=1694756474277&viewPanel=19 shows an increase of inactive minion jobs at an alarming rate. https://openqa.suse.de/minion/jobs?state=active shows 4 jobs for task "hook_script" active most of the time it seems, e.g. it showed when looking:

I triggered a retry on the last three assuming they were stuck but I think that was a wrong assessment.

As a consequence also many investigation jobs and such have not been executed.

Acceptance criteria

Suggestions

  • Mitigate and monitor the situation, e.g. manually stopping/retrying any stuck minion jobs, restarting openqa-gru, etc.
  • Follow the above minion job URLs to see what the state was
  • Crosscheck the current state on the system, e.g. check the process table, attach strace for blocked processes, etc.
  • Consider running more than 4 jobs:
# cat /usr/share/openqa/script/openqa-gru 
#!/bin/sh -e
[ "$1" = "-h" ] || [ "$1" = "--help" ] && echo "Start openQA GRU service" && exit
exec "$(dirname "$0")"/openqa gru -m production run --reset-locks --jobs 4 --spare 2 --spare-min-priority 10 "$@"
Actions #1

Updated by okurz about 1 year ago

  • Description updated (diff)
Actions #2

Updated by okurz about 1 year ago

Since I reported the ticket and simply retriggered three minion jobs I found that the minion jobs queue again rapidly decreases, see https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694747433967&to=1694772474931&viewPanel=19 . Unlikely to be a coincidence.

Actions #3

Updated by okurz about 1 year ago

  • Subject changed from hook_scripts apparently stuck for 8h to hook_scripts apparently stuck for 8h (by now back to good)

By now as visible in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694722550718&to=1694799719183&viewPanel=19 we are back to good. Still I would appreciate if we can investigate.

Actions #4

Updated by okurz about 1 year ago

  • Priority changed from Urgent to High
Actions #5

Updated by livdywan about 1 year ago

  • Subject changed from hook_scripts apparently stuck for 8h (by now back to good) to hook_scripts apparently stuck for 8h (by now back to good) size:M
  • Status changed from New to Workable
Actions #6

Updated by okurz about 1 year ago

  • Status changed from Workable to Rejected
  • Assignee set to okurz

Seems to have been a one-off, good enough

Actions

Also available in: Atom PDF