Project

General

Profile

action #135803

Updated by okurz 8 months ago

## Observation 
 https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694660738324&to=1694756474277&viewPanel=19 shows an increase of inactive minion jobs at an alarming rate. https://openqa.suse.de/minion/jobs?state=active shows 4 jobs for task "hook_script" active most of the time it seems, e.g. it showed when looking: 

 * https://openqa.suse.de/minion/jobs?id=8485276 
 * https://openqa.suse.de/minion/jobs?id=8485208 
 * https://openqa.suse.de/minion/jobs?id=8485171 
 * https://openqa.suse.de/minion/jobs?id=8485099 

 I triggered a retry on the last three assuming they were stuck but I think that was a wrong assessment. 

 As a consequence also many investigation jobs and such have not been executed. 

 ## Acceptance criteria 
 * **AC1:** https://openqa.suse.de/tests/12128032#comments has investigation job results shown 
 * **AC2:** An alert for https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694660738324&to=1694756474277&viewPanel=19 has been considered 
 * **AC3:** https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1694660738324&to=1694756474277&viewPanel=19 "inactive" is back to normal levels 

 ## Suggestions 
 * Mitigate and monitor the situation, e.g. manually stopping/retrying any stuck minion jobs, restarting openqa-gru, etc. 
 * Follow the above minion job URLs to see what the state was 
 * Crosscheck the current state on the system, e.g. check the process table, attach strace for blocked processes, etc. 
 * Consider running more than 4 jobs: 

 ``` 
 # cat /usr/share/openqa/script/openqa-gru  
 #!/bin/sh -e 
 [ "$1" = "-h" ] || [ "$1" = "--help" ] && echo "Start openQA GRU service" && exit 
 exec "$(dirname "$0")"/openqa gru -m production run --reset-locks --jobs 4 --spare 2 --spare-min-priority 10 "$@" 
 ```

Back