Project

General

Profile

coordination #135122

Updated by livdywan about 1 year ago

## Observation 
 From user reports in https://suse.slack.com/archives/C02CANHLANP/p1693798894476509 we learned that OSD openQA is not running the expected number of openQA jobs. https://openqa.suse.de/tests/ shows >3k jobs scheduled, only a single one in "running". No alerts known about it yet. 

 https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1693721199895&to=1693816630583 

 Maybe has to do with "max_running_jobs" and our artificial job limit? 

 From osd:/var/log/openqa_scheduler it seems the last time we had been exceeding "max_running_jobs" was yesterday morning: 

 ``` 
 [2023-09-03T06:42:55.629721+02:00] [debug] [pid:1477] max_running_jobs (220) exceeded, scheduling no additional jobs 
 ``` 

 but eventually no new jobs were scheduled: 

 ``` 
 [2023-09-03T06:37:08.921368+02:00] [debug] [pid:1477] Scheduler took 15.07441s to perform operations and allocated 80 jobs 
 [2023-09-03T06:38:25.471240+02:00] [debug] [pid:1477] max_running_jobs (220) exceeded, scheduling no additional jobs 
 [2023-09-03T06:39:55.414995+02:00] [debug] [pid:1477] max_running_jobs (220) exceeded, scheduling no additional jobs 
 [2023-09-03T06:41:34.751180+02:00] [debug] [pid:1477] Scheduler took 10.90210s to perform operations and allocated 36 jobs 
 [2023-09-03T06:42:55.629721+02:00] [debug] [pid:1477] max_running_jobs (220) exceeded, scheduling no additional jobs 
 [2023-09-03T06:44:30.842675+02:00] [debug] [pid:1477] Scheduler took 6.99158s to perform operations and allocated 12 jobs 
 [2023-09-03T06:45:58.842208+02:00] [debug] [pid:1477] Scheduler took 4.99004s to perform operations and allocated 1 jobs 
 [2023-09-03T06:47:38.480906+02:00] [debug] [pid:1477] Scheduler took 14.62729s to perform operations and allocated 7 jobs 
 [2023-09-03T06:49:09.720436+02:00] [debug] [pid:1477] Scheduler took 20.26448s to perform operations and allocated 2 jobs 
 [2023-09-03T06:49:24.279251+02:00] [debug] [pid:1477] Scheduler took 14.54522s to perform operations and allocated 2 jobs 
 [2023-09-03T06:49:29.162836+02:00] [debug] [pid:1477] Scheduler took 4.86909s to perform operations and allocated 5 jobs 
 [2023-09-03T06:49:33.687399+02:00] [debug] [pid:1477] Scheduler took 4.51766s to perform operations and allocated 1 jobs 
 [2023-09-03T06:52:25.667024+02:00] [debug] [pid:1477] Scheduler took 81.97759s to perform operations and allocated 20 jobs 
 [2023-09-03T06:52:41.096276+02:00] [debug] [pid:1477] Scheduler took 7.40601s to perform operations and allocated 2 jobs 
 [2023-09-03T06:54:13.828477+02:00] [debug] [pid:1477] Scheduler took 10.13754s to perform operations and allocated 1 jobs 
 [2023-09-03T06:55:48.446219+02:00] [debug] [pid:1477] Scheduler took 14.75419s to perform operations and allocated 10 jobs 
 [2023-09-03T06:57:21.131280+02:00] [debug] [pid:1477] Scheduler took 17.43828s to perform operations and allocated 2 jobs 
 [2023-09-03T06:58:42.071371+02:00] [debug] [pid:1477] Scheduler took 8.37779s to perform operations and allocated 6 jobs 
 [2023-09-03T06:59:49.418522+02:00] [debug] [pid:1477] Scheduler took 16.24076s to perform operations and allocated 2 jobs 
 [2023-09-03T06:59:54.048626+02:00] [debug] [pid:1477] Scheduler took 4.61859s to perform operations and allocated 2 jobs 
 [2023-09-03T07:01:29.409578+02:00] [debug] [pid:1477] Scheduler took 5.35949s to perform operations and allocated 0 jobs 
 [2023-09-03T07:03:19.697510+02:00] [debug] [pid:1477] Scheduler took 25.64661s to perform operations and allocated 6 jobs 
 [2023-09-03T07:04:47.643066+02:00] [debug] [pid:1477] Scheduler took 23.59141s to perform operations and allocated 8 jobs 
 [2023-09-03T07:06:07.213855+02:00] [debug] [pid:1477] Scheduler took 13.16123s to perform operations and allocated 3 jobs 
 [2023-09-03T07:07:47.049396+02:00] [debug] [pid:1477] Scheduler took 22.99531s to perform operations and allocated 1 jobs 
 [2023-09-03T07:08:24.139019+02:00] [debug] [pid:1477] Scheduler took 9.99626s to perform operations and allocated 2 jobs 
 [2023-09-03T07:08:29.208727+02:00] [debug] [pid:1477] Scheduler took 5.04446s to perform operations and allocated 0 jobs 
 [2023-09-03T07:10:05.727911+02:00] [debug] [pid:1477] Scheduler took 6.51611s to perform operations and allocated 4 jobs 
 [2023-09-03T07:11:41.546293+02:00] [debug] [pid:1477] Scheduler took 12.33484s to perform operations and allocated 6 jobs 
 [2023-09-03T07:13:13.093861+02:00] [debug] [pid:1477] Scheduler took 33.87103s to perform operations and allocated 5 jobs 
 [2023-09-03T07:13:20.125107+02:00] [debug] [pid:1477] Scheduler took 6.93538s to perform operations and allocated 8 jobs 
 [2023-09-03T07:15:06.484994+02:00] [debug] [pid:1477] Scheduler took 16.35689s to perform operations and allocated 3 jobs 
 [2023-09-03T07:16:40.364063+02:00] [debug] [pid:1477] Scheduler took 20.23532s to perform operations and allocated 2 jobs 
 [2023-09-03T07:16:48.671665+02:00] [debug] [pid:1477] Scheduler took 8.29525s to perform operations and allocated 4 jobs 
 [2023-09-03T07:17:25.439549+02:00] [debug] [pid:1477] Scheduler took 36.75237s to perform operations and allocated 2 jobs 
 [2023-09-03T07:20:04.238661+02:00] [debug] [pid:1477] Scheduler took 74.66060s to perform operations and allocated 10 jobs 
 [2023-09-03T07:20:50.585632+02:00] [debug] [pid:1477] Scheduler took 46.23221s to perform operations and allocated 3 jobs 
 [2023-09-03T07:22:25.632465+02:00] [debug] [pid:1477] Scheduler took 5.04418s to perform operations and allocated 0 jobs 
 [2023-09-03T07:24:07.989898+02:00] [debug] [pid:1477] Scheduler took 17.39979s to perform operations and allocated 2 jobs 
 [2023-09-03T07:24:56.537013+02:00] [debug] [pid:1477] Scheduler took 38.60689s to perform operations and allocated 3 jobs 
 [2023-09-03T07:25:10.040476+02:00] [debug] [pid:1477] Scheduler took 13.48023s to perform operations and allocated 7 jobs 
 [2023-09-03T07:27:56.371014+02:00] [debug] [pid:1477] Scheduler took 76.32781s to perform operations and allocated 2 jobs 
 [2023-09-03T07:28:15.311355+02:00] [debug] [pid:1477] Scheduler took 5.26678s to perform operations and allocated 0 jobs 
 [2023-09-03T07:29:45.190681+02:00] [debug] [pid:1477] Scheduler took 5.14464s to perform operations and allocated 0 jobs 
 [2023-09-03T07:31:17.581632+02:00] [debug] [pid:1477] Scheduler took 7.53423s to perform operations and allocated 1 jobs 
 [2023-09-03T07:32:54.416454+02:00] [debug] [pid:1477] Scheduler took 14.36776s to perform operations and allocated 1 jobs 
 [2023-09-03T07:33:22.111923+02:00] [debug] [pid:1477] Scheduler took 22.00829s to perform operations and allocated 4 jobs 
 [2023-09-03T07:33:26.993713+02:00] [debug] [pid:1477] Scheduler took 4.86527s to perform operations and allocated 2 jobs 
 [2023-09-03T07:35:09.693114+02:00] [debug] [pid:1477] Scheduler took 12.69689s to perform operations and allocated 1 jobs 
 [2023-09-03T07:36:32.347401+02:00] [debug] [pid:1477] Scheduler took 5.35075s to perform operations and allocated 0 jobs 
 [2023-09-03T07:38:02.165366+02:00] [debug] [pid:1477] Scheduler took 5.16799s to perform operations and allocated 0 jobs 
 [2023-09-03T07:39:32.304272+02:00] [debug] [pid:1477] Scheduler took 5.30574s to perform operations and allocated 0 jobs 
 [2023-09-03T07:41:01.816015+02:00] [debug] [pid:1477] Scheduler took 4.81563s to perform operations and allocated 0 jobs 
 [2023-09-03T07:42:32.059338+02:00] [debug] [pid:1477] Scheduler took 5.05789s to perform operations and allocated 0 jobs 
 [2023-09-03T07:43:21.293300+02:00] [debug] [pid:1477] Scheduler took 4.78223s to perform operations and allocated 0 jobs 
 [2023-09-03T07:43:25.971259+02:00] [debug] [pid:1477] Scheduler took 4.66514s to perform operations and allocated 0 jobs 
 [2023-09-03T07:43:32.419474+02:00] [debug] [pid:1477] Scheduler took 6.44357s to perform operations and allocated 0 jobs 
 [2023-09-03T07:45:07.570593+02:00] [debug] [pid:1477] Scheduler took 5.14955s to perform operations and allocated 0 jobs 
 … 
 ``` 

 I restarted both openqa-websockets as well as openqa-scheduler but to no effect. 

 ## Acceptance criteria 
 * **AC1:** A situation like this raises an alert 
 * **AC2:** New jobs are scheduled while max_running_jobs is clearly not exceeded and scheduled jobs exist 

 ## Rollback 
 - Undo #note-18 and #note-20 

 ## Suggestions 
 - Monitor https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&viewPanel=9&from=1694043361041&to=1694062844795 (click Running)

Back