Project

General

Profile

action #65975

Updated by okurz almost 4 years ago

## Acceptance criteria 

 * **AC1:** A notification on alerts is sent over the usual channels when there are scheduled jobs older than defined limits, e.g. 2 days 
 * **AC2:** A notification on alerts is sent over the usual channels when there are more than X scheduled jobs with less than Y running jobs, e.g. X > 500 scheduled (not counting blocked), Y < 20 


 ## Original 

 Today morning, after deployment, I got pinged in RC that the workers don't execute jobs anymore. Looking at the logs I saw: 

 ``` 
 Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Registering with openQA openqa.suse.de 
 Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1070 
 Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1070 
 Apr 22 10:59:11 openqaworker2 worker[21179]: [warn] [pid:21179] Websocket connection to http://openqa.suse.de/api/v1/ws/1070 finished by remote side with code 1006, no reason - trying again in 10 seconds 
 ``` 

 This was repeated all the time. We "fixed" the problem by downgrading from perl-Mojolicious-8.37 to perl-Mojolicious-8.36 on OSD and restarting the openqa-websockets service.

Back