Project

General

Profile

action #112346

[alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-06-13
Due date:
% Done:

0%

Estimated time:

Description

Observation

See alerts on https://mailman.suse.de/mlarch/SuSE/osd-admins/2022/osd-admins.2022.06/maillist.html

[OK] openqaworker3: Download rate alert, 20:47:09, Grafana
[OK] openqaworker9: Download rate alert, 20:15:55, Grafana
[OK] openqaworker8: Download rate alert, 20:11:40, Grafana
[OK] openqaworker2: Download rate alert, 20:07:25, Grafana
[Alerting] openqaworker9: Download rate alert, 19:57:55, Grafana
[Alerting] openqaworker8: Download rate alert, 19:57:13, Grafana
[Alerting] openqaworker2: Download rate alert, 19:56:49, Grafana
[Alerting] openqaworker3: Download rate alert, 19:56:49, Grafana
[OK] Job age (scheduled) (median) alert, 13:02:25, Grafana
[Alerting] Job age (scheduled) (median) alert, 11:40:50, Grafana

Suggestions

Follow https://progress.opensuse.org/projects/qa/wiki/Tools#Process
Look at https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker3/worker-dashboard-openqaworker3?editPanel=65109&tab=alert&orgId=1
Maybe the webUI was affected since workers 2,3,8,9 were impacted

  • Looks like it was only individual jobs on each worker conducting download, maybe again just zypper and mirror infrastructure problems, see https://progress.opensuse.org/issues/112232 -> Ask where the monitoring for the mirroring infrastructure is. Most likely there is none so it's again openQA tests that do the monitoring \o/ -> can't be because this is about asset download
  • likely only single jobs where downloading something, others could relate to cache, so check for the corresponding time what happened on osd
  • Create separate ticket to handle the job age ticket better so that individual jobs stuck in the queue when the schedule is otherwise empty will not trigger alerts
  • Check which jobs were running on the workers, e.g.: select id, t_started, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker9' and t_started >= '2022-06-12T16:44:00' and t_started < '2022-06-12T22:48:00' order by t_finished;

Related issues

Related to openQA Infrastructure - action #112583: [alert] "Job age" on OSD 2022-06-12Rejected2022-06-16

History

#1 Updated by cdywan about 2 months ago

To be clear all alerts went OK in the meantime, right? So we don't know if there's an imminent problem but we should still investigate it soon to be able to re-trace it.

#2 Updated by okurz about 2 months ago

  • Subject changed from [alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 to [alert] multiple alerts about "Download rate" and "Job age" on OSD 2022-06-12 size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler about 2 months ago

  • Tags deleted (alert, osd, download rate, job age)
  • Description updated (diff)
  • Target version deleted (Ready)

#4 Updated by mkittler about 2 months ago

  • Tags set to alert, osd, downlod rate, job age
  • Target version set to Ready

#5 Updated by kraih about 2 months ago

  • Assignee set to kraih

#6 Updated by kraih about 2 months ago

Unfortunately we only keep 48 hours of data in the cache service database on the workers, so the easy way to investigate is gone now.

#7 Updated by kraih about 2 months ago

Started with openqaworker3, and found a slow download. At first glance the results do look a bit suspicious, since the output is not 100% consistent. But that's related to when the worker polled the cache service for the latest status update. If the timing is bad it will only get a "...was downloaded by #18525, details are therefore unavailable here". We can probably improve that, but it's not a big issue.

https://openqa.suse.de/tests/8942209/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:10.888508+02:00] [info] [pid:15886] Downloading *-x86-*-rpi-411745.qcow2, request #18526 sent to Cache Service
[2022-06-12T16:56:28.917453+02:00] [info] [pid:15886] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#18525] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#18525] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942211/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#18525] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#18525] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (4.1 MiB/s), new cache size is 48 GiB

https://openqa.suse.de/tests/8942224/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:16.362740+02:00] [info] [pid:15640] Downloading *-x86-*-rpi-411745.qcow2, request #18529 sent to Cache Service
[2022-06-12T16:56:27.279998+02:00] [info] [pid:15640] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#18525] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
[info] [#18525] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942211/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#18525] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#18525] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (4.1 MiB/s), new cache size is 48 GiB

https://openqa.suse.de/tests/8942214/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:12.708636+02:00] [info] [pid:16464] Downloading *-x86-*-rpi-411745.qcow2, request #18528 sent to Cache Service
[2022-06-12T16:56:24.553073+02:00] [info] [pid:16464] Download of *-x86-*-rpi-411745.qcow2 processed:
Asset "*-x86-*-rpi-411745.qcow2" was downloaded by #18525, details are therefore unavailable here

Afterwards there were quite a few jobs reusing the same asset, so no new download to update the download stats. And only a few hours later a job arrived that required a download again.

https://openqa.suse.de/tests/8943332/logfile?filename=autoinst-log.txt

[info] [#18980] Downloading "sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2" from "http://openqa.suse.de/tests/8943332/asset/hdd/sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2"
[info] [#18980] Cache size 50 GiB + needed 1.5 GiB exceeds limit of 50 GiB, purging least used assets
[info] [#18980] Purging "/var/lib/openqa/cache/openqa.suse.de/publiccloud_12-SP5_EC2-BYOS-Updates.qcow2" because we need space for new assets, reclaiming 2.4 GiB
[info] [#18980] Size of "/var/lib/openqa/cache/openqa.suse.de/sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2" is 1.5 GiB, with ETag ""616f4400-5e147598a63e5""
[info] [#18980] Download of "/var/lib/openqa/cache/openqa.suse.de/sle-12-SP4-x86_64-Build20220612-1-HA-QAM.qcow2" successful (42 MiB/s), new cache size is 49 GiB

So it looks like the network was a bit slow around 2022-06-12T16:55:10, but the file was only slightly above the 250MB minimum for our measurements, so accuracy wasn't very good.

#8 Updated by kraih about 2 months ago

Looking at the historic data for openqaworker3, there were a few cases in the last 7 days where the download rate went similarly low. But it recovered again almost immediately because there were more downloads. At least for this worker the alert seems to have been bad luck.

#9 Updated by kraih about 2 months ago

  • Status changed from Workable to In Progress

#10 Updated by kraih about 2 months ago

And on openqaworker2 it seems to have been the same situation:

https://openqa.suse.de/tests/8942213/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:12.789197+02:00] [info] [pid:10642] Downloading *-x86-*-rpi-411745.qcow2, request #6659 sent to Cache Service
[2022-06-12T16:56:28.770448+02:00] [info] [pid:10642] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#6659] Cache size of "/var/lib/openqa/cache" is 43 GiB, with limit 50 GiB
[info] [#6659] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942213/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#6659] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#6659] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (4 MiB/s), new cache size is 43 GiB

#11 Updated by kraih about 2 months ago

Same on openqaworker8:

https://openqa.suse.de/tests/8942219/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:14.662902+02:00] [info] [pid:24466] Downloading *-x86-*-rpi-411745.qcow2, request #19894 sent to Cache Service
[2022-06-12T16:56:36.364881+02:00] [info] [pid:24466] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#19894] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
[info] [#19894] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942219/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#19894] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#19894] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (3.8 MiB/s), new cache size is 50 GiB

#12 Updated by kraih about 2 months ago

And openqaworker9 too:

https://openqa.suse.de/tests/8942210/logfile?filename=autoinst-log.txt

[2022-06-12T16:55:10.670081+02:00] [info] [pid:3091] Downloading *-x86-*-rpi-411745.qcow2, request #18595 sent to Cache Service
[2022-06-12T16:56:44.058225+02:00] [info] [pid:3091] Download of *-x86-*-rpi-411745.qcow2 processed:
[info] [#18595] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
[info] [#18595] Downloading "*-x86-*-rpi-411745.qcow2" from "http://openqa.suse.de/tests/8942210/asset/hdd/*-x86-*-rpi-411745.qcow2"
[info] [#18595] Size of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" is 297 MiB, with ETag ""12900000-5e1415b6d8177""
[info] [#18595] Download of "/var/lib/openqa/cache/openqa.suse.de/*-x86-*-rpi-411745.qcow2" successful (3.4 MiB/s), new cache size is 50 GiB

#13 Updated by kraih about 2 months ago

So 4 workers were downloading the same asset at the same time, that should have slowed things down. The asset was also very small, making the measurement unreliable. Maybe the network was a bit slow too, but ping times look ok. And then there were only jobs using that same asset for the next few hours, so the download rate did not get updated.

Looks like bad luck to me, the alert itself seems fine. If we do want to do something, we could increase the minimum asset size for download rate updates. The current 250MB seems a bit small, maybe 500MB would be better to get more reliable data?

#14 Updated by kraih about 2 months ago

  • Related to action #112583: [alert] "Job age" on OSD 2022-06-12 added

#15 Updated by kraih about 2 months ago

  • Status changed from In Progress to Feedback

Not sure why the job age alert was included here, but it seems completely unrelated. So i've moved it into a separate ticket, #112583.

#16 Updated by kraih about 2 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF