Project

General

Profile

action #157666

Updated by okurz about 2 months ago

## Observation 
 No firing alerts yet. Problem was reported in https://suse.slack.com/archives/C02CANHLANP/p1711027186175969, people couldn't access the webUI and jobs were delayed in starting. Also reported in https://suse.slack.com/archives/C02CANHLANP/p1711019658421359 

 ## Suggestions 
 * Maybe again too many online worker instances https://openqa.suse.de/admin/workers as in #135122 
 * Disable/disconnect/poweroff some workers to reduce load on OSD -> okurz did that with worker3[6-9] by completely powering of the machines reducing the number of online openQA worker instances from 1006 to 878, see https://openqa.suse.de/admin/workers . After that the openQA scheduler near-immediate picked up new jobs again so apparently this was a helpful mitigation. 

 ## Rollback steps 
 * Bring worker3[6-9] back online using IPMI commands

Back