Project

General

Profile

Actions

action #179317

closed

coordination #161414: [epic] Improved salt based infrastructure management

[osd] openQA does not pick up most scheduled jobs, sluggish start, multiple failures on 2025-03-21

Added by waynechen55 13 days ago. Updated 9 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-03-21
Due date:
% Done:

0%

Estimated time:

Description

Observation

There are many scheduled test runs and, at the same time, many idle ipmi workers. But these workers do not pick up scheduled jobs to run.


Steps to reproduce

  • Schedule test runs with ipmi workers
  • Wait for it to be picked up by idle workers

Impact

All test runs can not start

Problem

Looks like something wrong with workers

Suggestions

  • Check workers states
  • Check workers machines

Workaround

n/a


Files

scheduled_jobs.png (60.5 KB) scheduled_jobs.png waynechen55, 2025-03-21 03:11
idle_ipmi_workers.png (119 KB) idle_ipmi_workers.png waynechen55, 2025-03-21 03:11
37_jobs_running.png (31.2 KB) 37_jobs_running.png waynechen55, 2025-03-21 06:40

Related issues 1 (1 open0 closed)

Related to openQA Infrastructure (public) - action #176175: [alert] Grafana failed to start due to corrupted config fileBlockedokurz2025-01-26

Actions
Actions #1

Updated by okurz 13 days ago

  • Category set to Support
  • Assignee set to okurz
  • Target version set to Ready

Hi, did you check https://openqa.suse.de/tests how many jobs are running? I assume OSD is in the global job limit so only up to 330 jobs can run at a time

Actions #2

Updated by waynechen55 13 days ago

okurz wrote in #note-1:

Hi, did you check https://openqa.suse.de/tests how many jobs are running? I assume OSD is in the global job limit so only up to 330 jobs can run at a time

Only 37 jobs running:

Actions #3

Updated by okurz 13 days ago

  • Tags set to infra, reactive work, osd
  • Subject changed from [openQA][worker][ipmi] idle ipmi workers do not pick up scheduled jobs to [osd] openQA does not pick up most scheduled jobs, sluggish start, multiple failures on 2025-03-21
  • Status changed from New to In Progress
  • Priority changed from Normal to Urgent

Multiple other problems also observed and other user reports. Multiple services reported that postgresql only allows superuser connections. I restarted postgresql as well as openqa-{webui,scheduler,websockets,gru} and jobs seem to be picked up now. And I retried multiple jobs on https://openqa.suse.de/minion/jobs

Actions #4

Updated by okurz 13 days ago

Multiple openQA jobs were still endlessly waiting for minion jobs of task "git_clone" to be executed but the according minion jobs failed and it seems the openQA jobs were never updated. I walked over https://openqa.suse.de/minion/jobs?state=failed and retried 100 job. Multiple failures in obs_rsync_run probably due to incorrect repository metadata in the openqa-sync-and-trigger configuration. Also I observed that again /etc/openqa/openqa.ini was corrupted with null bytes as well as incomplete keys and sections. Recovered that from backup and restarted services again.

Actions #5

Updated by okurz 13 days ago

  • Related to action #176175: [alert] Grafana failed to start due to corrupted config file added
Actions #6

Updated by okurz 13 days ago

  • Category changed from Support to Regressions/Crashes
  • Parent task set to #161414
Actions #7

Updated by okurz 12 days ago

  • Due date set to 2025-04-04
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

Generally the situation has stabilized. There is still a firing alert though

https://suse.slack.com/archives/C02AJ1E568M/p1742567771058469

do you have an idea how we could make the alert https://monitor.qa.suse.de/alerting/grafana/d949dbae-8034-4bf4-8418-f148dfcaf89d/view recover quicker? Right now the root alert condition isn't present anymore but the alert looks at all 5xx responses from last 24h meaning that it takes up to 24h to go back to normal.

Actions #8

Updated by tinita 12 days ago · Edited

https://openqa.suse.de/tests/17113937
State: scheduled, waiting for background tasks (id: 40287864, name: git_clone), created about 14 hours ago

Problem is, there is an entry in gru_tasks and ~gru_dependencies`, but no corresponding minion job exists.

openqa=> select * from gru_tasks where id = 40287864;
    id    | taskname  |                                                         args                                                          |       run_at        | priority |      t_created      |      t_updated      
----------+-----------+-----------------------------------------------------------------------------------------------------------------------+---------------------+----------+---------------------+---------------------
 40287864 | git_clone | [{"\/var\/lib\/openqa\/share\/tests\/sle":null,"\/var\/lib\/openqa\/share\/tests\/sle\/products\/sle\/needles":null}] | 2025-03-21 00:30:20 |       10 | 2025-03-21 00:30:20 | 2025-03-21 
00:30:20
(1 row)

openqa=> select * from gru_dependencies where job_id = 17113937;
  job_id  | gru_task_id 
----------+-------------
 17113937 |    40287864
(1 row)

openqa=> select * from minion_jobs where task='git_clone'and notes::varchar like '%40287864%' limit 1;
 id | args | created | delayed | finished | priority | result | retried | retries | started | state | task | worker | queue | attempts | parents | notes | expires | lax 
----+------+---------+---------+----------+----------+--------+---------+---------+---------+-------+------+--------+-------+----------+---------+-------+---------+-----
(0 rows)

There are 82 jobs with the same gru_task_id. Will delete those entries now.

View details...

edit: I did

delete from gru_dependencies where gru_task_id = 40287864;
delete  from gru_tasks where id = 40287864;
Actions #9

Updated by tinita 12 days ago · Edited

I looked for more possibly waiting jobs:

View details...

will delete those entries as well.

edit: done.
there were even jobs from last year. but the older jobs were cancelled already.

Actions #10

Updated by okurz 9 days ago

  • Due date deleted (2025-04-04)
  • Status changed from Feedback to Resolved

All points covered, still good. No more related alerts and system looks stable now.

Actions

Also available in: Atom PDF