Project

General

Profile

Actions

action #112346

closed

[alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2022-06-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

See alerts on https://mailman.suse.de/mlarch/SuSE/osd-admins/2022/osd-admins.2022.06/maillist.html

[OK] openqaworker3: Download rate alert, 20:47:09, Grafana
[OK] openqaworker9: Download rate alert, 20:15:55, Grafana
[OK] openqaworker8: Download rate alert, 20:11:40, Grafana
[OK] openqaworker2: Download rate alert, 20:07:25, Grafana
[Alerting] openqaworker9: Download rate alert, 19:57:55, Grafana
[Alerting] openqaworker8: Download rate alert, 19:57:13, Grafana
[Alerting] openqaworker2: Download rate alert, 19:56:49, Grafana
[Alerting] openqaworker3: Download rate alert, 19:56:49, Grafana
[OK] Job age (scheduled) (median) alert, 13:02:25, Grafana
[Alerting] Job age (scheduled) (median) alert, 11:40:50, Grafana

Suggestions

Follow https://progress.opensuse.org/projects/qa/wiki/Tools#Process
Look at https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker3/worker-dashboard-openqaworker3?editPanel=65109&tab=alert&orgId=1
Maybe the webUI was affected since workers 2,3,8,9 were impacted

  • Looks like it was only individual jobs on each worker conducting download, maybe again just zypper and mirror infrastructure problems, see https://progress.opensuse.org/issues/112232 -> Ask where the monitoring for the mirroring infrastructure is. Most likely there is none so it's again openQA tests that do the monitoring \o/ -> can't be because this is about asset download
  • likely only single jobs where downloading something, others could relate to cache, so check for the corresponding time what happened on osd
  • Create separate ticket to handle the job age ticket better so that individual jobs stuck in the queue when the schedule is otherwise empty will not trigger alerts
  • Check which jobs were running on the workers, e.g.: select id, t_started, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker9' and t_started >= '2022-06-12T16:44:00' and t_started < '2022-06-12T22:48:00' order by t_finished;

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #112583: [alert] "Job age" on OSD 2022-06-12Rejectedkraih2022-06-16

Actions
Actions

Also available in: Atom PDF