Project

General

Profile

action #157666

Updated by okurz about 2 months ago

## Observation 
 No firing alerts yet. Problem was reported in https://suse.slack.com/archives/C02CANHLANP/p1711027186175969, people couldn't access the webUI and jobs were delayed in starting. Also reported in https://suse.slack.com/archives/C02CANHLANP/p1711019658421359 

 ## Suggestions 
 * Maybe again too many online worker instances https://openqa.suse.de/admin/workers as in #135122 
 * Disable/disconnect/poweroff some workers to reduce load on OSD -> okurz did that with worker3[6-9] by completely powering of the machines reducing the number of online openQA worker instances from 1006 to 878, see https://openqa.suse.de/admin/workers . After that the openQA scheduler near-immediate picked up new jobs again so apparently this was a helpful mitigation. 
 * Monitor over the next days/weeks if we are hitting unresponsiveness and scheduler refusing to assign jobs. If we still see such issues then we must investigate further what else could be the problem otherwise we would need to block on #157669 before enabling more worker instances again. 

 ## Out of scope 
 * Any significant change in implementation 

 ## Rollback steps 
 * Bring worker3[6-9] back online using IPMI commands

Back