action #112346: [alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #112346

closed

[alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

kraih

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-06-13

Due date:

% Done:

Estimated time:

Tags:

alert, osd, job age, downlod rate

Description

Observation¶

See alerts on https://mailman.suse.de/mlarch/SuSE/osd-admins/2022/osd-admins.2022.06/maillist.html

[OK] openqaworker3: Download rate alert, 20:47:09, Grafana
[OK] openqaworker9: Download rate alert, 20:15:55, Grafana
[OK] openqaworker8: Download rate alert, 20:11:40, Grafana
[OK] openqaworker2: Download rate alert, 20:07:25, Grafana
[Alerting] openqaworker9: Download rate alert, 19:57:55, Grafana
[Alerting] openqaworker8: Download rate alert, 19:57:13, Grafana
[Alerting] openqaworker2: Download rate alert, 19:56:49, Grafana
[Alerting] openqaworker3: Download rate alert, 19:56:49, Grafana
[OK] Job age (scheduled) (median) alert, 13:02:25, Grafana
[Alerting] Job age (scheduled) (median) alert, 11:40:50, Grafana

Suggestions¶

Follow https://progress.opensuse.org/projects/qa/wiki/Tools#Process
Look at https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker3/worker-dashboard-openqaworker3?editPanel=65109&tab=alert&orgId=1
Maybe the webUI was affected since workers 2,3,8,9 were impacted

Looks like it was only individual jobs on each worker conducting download, maybe again just zypper and mirror infrastructure problems, see https://progress.opensuse.org/issues/112232 -> Ask where the monitoring for the mirroring infrastructure is. Most likely there is none so it's again openQA tests that do the monitoring \o/ -> can't be because this is about asset download
likely only single jobs where downloading something, others could relate to cache, so check for the corresponding time what happened on osd
Create separate ticket to handle the job age ticket better so that individual jobs stuck in the queue when the schedule is otherwise empty will not trigger alerts
Check which jobs were running on the workers, e.g.: select id, t_started, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker9' and t_started >= '2022-06-12T16:44:00' and t_started < '2022-06-12T22:48:00' order by t_finished;

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan over 2 years ago

To be clear all alerts went OK in the meantime, right? So we don't know if there's an imminent problem but we should still investigate it soon to be able to re-trace it.

Actions

Copy link

Updated by okurz over 2 years ago

Subject changed from [alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 to [alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 2 years ago

Tags deleted (~~alert, osd, download rate, job age~~)
Description updated (diff)
Target version deleted (~~Ready~~)

Actions

Copy link

Updated by mkittler over 2 years ago

Tags set to alert, osd, downlod rate, job age
Target version set to Ready

Actions

Copy link

Updated by kraih over 2 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih over 2 years ago

Unfortunately we only keep 48 hours of data in the cache service database on the workers, so the easy way to investigate is gone now.

Actions

Copy link

Updated by kraih over 2 years ago

Started with openqaworker3, and found a slow download. At first glance the results do look a bit suspicious, since the output is not 100% consistent. But that's related to when the worker polled the cache service for the latest status update. If the timing is bad it will only get a "...was downloaded by #18525, details are therefore unavailable here". We can probably improve that, but it's not a big issue.

https://openqa.suse.de/tests/8942209/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:10.888508+02:00] [info] [pid:15886] Downloading *-x86-*-rpi-411745.qcow2, request #18526 sent to Cache Service
[2022-06-12T16:56:28.917453+02:00] [info] [pid:15886] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#18525] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#18525] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942211/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#18525] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#18525] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (4.1 MiB/s), new cache size is 48 GiB

https://openqa.suse.de/tests/8942224/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:16.362740+02:00] [info] [pid:15640] Downloading *-x86-*-rpi-411745.qcow2, request #18529 sent to Cache Service
[2022-06-12T16:56:27.279998+02:00] [info] [pid:15640] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#18525] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#18525] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942211/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#18525] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#18525] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (4.1 MiB/s), new cache size is 48 GiB

https://openqa.suse.de/tests/8942214/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:12.708636+02:00] [info] [pid:16464] Downloading *-x86-*-rpi-411745.qcow2, request #18528 sent to Cache Service
[2022-06-12T16:56:24.553073+02:00] [info] [pid:16464] Download of *-x86-*-rpi-411745.qcow2 processed:
Asset "*-x86-*-rpi-411745.qcow2" was downloaded by #18525, details are therefore unavailable here

Afterwards there were quite a few jobs reusing the same asset, so no new download to update the download stats. And only a few hours later a job arrived that required a download again.

https://openqa.suse.de/tests/8943332/logfile?filename=autoinst-log.txt

[info] [#18980] Downloading "sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2" from "http://openqa.suse.de/tests/8943332/asset/hdd/sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2"
[info] [#18980] Cache size 50 GiB + needed 1.5 GiB exceeds limit of 50 GiB, purging least used assets
[info] [#18980] Purging "/var/lib/openqa/cache/openqa.suse.de/publiccloud_12-SP5_EC2-BYOS-Updates.qcow2" because we need space for new assets, reclaiming 2.4 GiB
[info] [#18980] Size of "/var/lib/openqa/cache/openqa.suse.de/sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2" is 1.5 GiB, with ETag ""616f4400-5e147598a63e5""
[info] [#18980] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2" successful (42 MiB/s), new cache size is 49 GiB

So it looks like the network was a bit slow around 2022-06-12T16:55:10, but the file was only slightly above the 250MB minimum for our measurements, so accuracy wasn't very good.

Actions

Copy link

Updated by kraih over 2 years ago

Looking at the historic data for openqaworker3, there were a few cases in the last 7 days where the download rate went similarly low. But it recovered again almost immediately because there were more downloads. At least for this worker the alert seems to have been bad luck.

Actions

Copy link

Updated by kraih over 2 years ago

Status changed from Workable to In Progress

Actions

Copy link

#10

Updated by kraih over 2 years ago

And on openqaworker2 it seems to have been the same situation:

https://openqa.suse.de/tests/8942213/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:12.789197+02:00] [info] [pid:10642] Downloading *-x86-*-rpi-411745.qcow2, request #6659 sent to Cache Service
[2022-06-12T16:56:28.770448+02:00] [info] [pid:10642] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#6659] Cache size of "/var/lib/openqa/cache" is 43 GiB, with limit 50 GiB
[info] [#6659] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942213/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#6659] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#6659] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (4 MiB/s), new cache size is 43 GiB

Actions

Copy link

#11

Updated by kraih over 2 years ago

Same on openqaworker8:

https://openqa.suse.de/tests/8942219/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:14.662902+02:00] [info] [pid:24466] Downloading *-x86-*-rpi-411745.qcow2, request #19894 sent to Cache Service
[2022-06-12T16:56:36.364881+02:00] [info] [pid:24466] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#19894] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
[info] [#19894] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942219/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#19894] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#19894] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (3.8 MiB/s), new cache size is 50 GiB

Actions

Copy link

#12

Updated by kraih over 2 years ago

And openqaworker9 too:

https://openqa.suse.de/tests/8942210/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:10.670081+02:00] [info] [pid:3091] Downloading *-x86-*-rpi-411745.qcow2, request #18595 sent to Cache Service
[2022-06-12T16:56:44.058225+02:00] [info] [pid:3091] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#18595] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
[info] [#18595] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942210/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#18595] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#18595] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (3.4 MiB/s), new cache size is 50 GiB

Actions

Copy link

#13

Updated by kraih over 2 years ago

So 4 workers were downloading the same asset at the same time, that should have slowed things down. The asset was also very small, making the measurement unreliable. Maybe the network was a bit slow too, but ping times look ok. And then there were only jobs using that same asset for the next few hours, so the download rate did not get updated.

Looks like bad luck to me, the alert itself seems fine. If we do want to do something, we could increase the minimum asset size for download rate updates. The current 250MB seems a bit small, maybe 500MB would be better to get more reliable data?

Actions

Copy link

#14

Updated by kraih over 2 years ago

Related to action #112583: [alert] "Job age" on OSD 2022-06-12 added

Actions

Copy link

#15

Updated by kraih over 2 years ago

Status changed from In Progress to Feedback

Not sure why the job age alert was included here, but it seems completely unrelated. So i've moved it into a separate ticket, #112583.

Actions

Copy link

#16

Updated by kraih over 2 years ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #112346

[alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M

Observation¶

Suggestions¶

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago