action #99741

Updated by okurz about 1 year ago

Minion jobs for job hooks failed silently on o3 

 ## Observation 

 Gru logs show entries like this for minion jobs: 

     Oct 01 14:25:44 ariel openqa-gru[30835]: Can't exec "/bin/sh": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Task/Job/ line 63. 

 Relevant minion jobs are shown as **finished** rather than "failed", e.g. with the following details: 

 - 1951060 
 - ~ 
 attempts: 1 
 children: [] 
 created: 2021-10-02T13:46:14.14573Z 
 delayed: 2021-10-02T13:46:14.14573Z 
 expires: ~ 
 finished: 2021-10-02T13:46:14.41935Z 
 id: 800152 
 lax: 0 
   gru_id: 17752756 
   hook_cmd: env scheme=http exclude_group_regex='(Development|Open Build Service|Others|Kernel).*/.*' 
   hook_rc: -1 
 parents: [] 
 priority: -10 
 queue: default 
 result: Job successfully executed 
 retried: ~ 
 retries: 0 
 started: 2021-10-02T13:46:14.15145Z 
 state: finished 
 task: finalize_job_results 
 time: 2021-10-04T13:59:22.27403Z 
 worker: 744 

 ## Acceptance criteria 
 * **AC1:** Alerts are received for both osd+o3 if a high (configurable?) amount (or ratio) of hook scripts fail 

 ## Suggestions 
 * Read [lib/OpenQA/Task/Job/]( 
 * Put a postgres query like `select id, args, notes->'hook_rc' as hook_rc, notes->'hook_result' as hook_result, created, finished from minion_jobs where cast(notes->'hook_rc' as int) != 0 and finished >= timezone('UTC', now()) - interval '24 hour' order by finished;` into 
 * Don't query too frequently (once an hour maybe), since any query will have to be rather inefficient 
 * Consider a solution for o3 as well, e.g. error in log that openqa_logwarn would alert us about 
 * Optional: Add an according grafana panel with alert and description