Project

General

Profile

action #101265

coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Upgrade arm3 to Leap 15.3 and compare failure rate size:M

Added by cdywan about 2 months ago. Updated 22 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-10-15
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%

Acceptance criteria

  • AC1: arm3 is running Leap 15.3
  • AC2: arm3 fail-ratio is known from a sufficiently large set

Suggestions

sysctl_diff.html (39.3 KB) sysctl_diff.html arm4 left, arm3 right nicksinger, 2021-10-18 11:36

Related issues

Copied from openQA Project - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3Workable2021-10-15

History

#1 Updated by cdywan about 2 months ago

  • Copied from coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 added

#2 Updated by okurz about 2 months ago

  • Tracker changed from coordination to action

#3 Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

#4 Updated by mkittler about 1 month ago

  • Description updated (diff)

I assume the relevant Wiki section is the one about distribution upgrades.

#5 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

Unfortunately the repository from the IBS project https://build.suse.de/project/show/NON_Public:infrastructure isn't available for Leap 15.3. Maybe the repository can simply be deleted considering it isn't present on arm 4 and 5 which already use Leap 15.3. However, I'd like to clarify that before continuing.

#6 Updated by okurz about 1 month ago

As clarified in chat we should be ok to go ahead without the repo or removing it. However it should be ok to use https://build.suse.de/project/repository_state/NON_Public:infrastructure/SLE_15_SP3 as Leap 15.3 == SLE 15 SP3 at least regarding building packages.

#7 Updated by mkittler about 1 month ago

  • Status changed from Feedback to In Progress

Somehow I've missed the chat messages but that answer is good enough for me :-)
I'll try using the SLE_15_SP3 repo then.

#8 Updated by openqa_review about 1 month ago

  • Due date set to 2021-11-11

Setting due date based on mean cycle time of SUSE QE Tools

#9 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

Since the Grafana dashboard shows the fail ratio for all jobs I've been executing manual queries for the failure rate before and after the upgrade.

Before:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished < '2021-10-28' group by host;
        host        | ratio_failed_by_host 
--------------------+----------------------
 openqaworker-arm-1 |                17.72
 openqaworker-arm-2 |                15.04
 openqaworker-arm-3 |                 14.8
 openqaworker-arm-4 |                39.14
 openqaworker-arm-5 |                37.54
(5 Zeilen)

After:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
        host        | ratio_failed_by_host 
--------------------+----------------------
 openqaworker-arm-1 |                14.81
 openqaworker-arm-2 |                11.57
 openqaworker-arm-3 |                 6.13
(3 Zeilen)

I'll re-run the 2nd query again after a few days to see how it changes after more jobs have been processed.

(ARM 4 and 5 don't show up in the 2nd table because their worker class has been changed so they don't run jobs anymore at the moment.)


Note that after updating o3 workers (not aarch64 workers specifically) to Leap 15.3 problems with the new QEMU version came up leading to failures like https://openqa.opensuse.org/tests/1997053#step/welcome/5. I am not sure whether this issue is actually related, though.

#10 Updated by mkittler about 1 month ago

By the way, I've dug a bit in the job history of ARM 4 and 5 and the failing jobs were failing for various different reasons which makes the problem hard to pin-down. I also haven't spotted an USB boot job so the issue mentioned in the previous comment is likely unrelated.

#11 Updated by mkittler about 1 month ago

After the upgrade, arm-3's failure rate is still pretty much the same:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
        host        | ratio_failed_by_host 
--------------------+----------------------
 openqaworker-arm-1 |                13.25
 openqaworker-arm-2 |                13.14
 openqaworker-arm-3 |                13.61
(3 Zeilen)

#12 Updated by mkittler about 1 month ago

It looks still good, but so does arm-4 and arm-5 now (after kraih enabled them again):

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
        host        | ratio_failed_by_host 
--------------------+----------------------
 openqaworker-arm-1 |                 12.4
 openqaworker-arm-2 |                12.78
 openqaworker-arm-3 |                12.23
 openqaworker-arm-4 |                 7.14
 openqaworker-arm-5 |                11.11
(5 Zeilen)

#13 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

I think > 2000 jobs are enough to tell that installing Leap 15.3 did not increase the failure rate on arm-3:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                 11.2 |  1214
 openqaworker-arm-2 |                11.73 |  2216
 openqaworker-arm-3 |                11.28 |  2305
 openqaworker-arm-4 |                 9.09 |    88
 openqaworker-arm-5 |                 17.5 |    80

When looking for incomplete jobs the figures look similar.

By the way, the Kernel arm-3 has run on since the Leap 15.3 update is:

martchus@openqaworker-arm-3:~> uname -a
Linux openqaworker-arm-3 5.3.18-59.27-default #1 SMP Tue Oct 5 10:00:40 UTC 2021 (7df2404) aarch64 aarch64 aarch64 GNU/Linux

#14 Updated by okurz 22 days ago

  • Due date deleted (2021-11-11)

Also available in: Atom PDF