action #101265
closedcoordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Upgrade arm3 to Leap 15.3 and compare failure rate size:M
Added by livdywan almost 3 years ago. Updated almost 3 years ago.
Description
Observation¶
According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%
Acceptance criteria¶
- AC1: arm3 is running Leap 15.3
- AC2: arm3 fail-ratio is known from a sufficiently large set
Suggestions¶
- Upgrade arm3 to Leap 15.3 and compare failure rate
- Read https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Distribution-upgrades
Files
sysctl_diff.html (39.3 KB) sysctl_diff.html | arm4 left, arm3 right | nicksinger, 2021-10-18 11:36 |
Updated by livdywan almost 3 years ago
- Copied from coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 added
Updated by mkittler almost 3 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler almost 3 years ago
- Description updated (diff)
I assume the relevant Wiki section is the one about distribution upgrades.
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Feedback
Unfortunately the repository from the IBS project https://build.suse.de/project/show/NON_Public:infrastructure isn't available for Leap 15.3. Maybe the repository can simply be deleted considering it isn't present on arm 4 and 5 which already use Leap 15.3. However, I'd like to clarify that before continuing.
Updated by okurz almost 3 years ago
As clarified in chat we should be ok to go ahead without the repo or removing it. However it should be ok to use https://build.suse.de/project/repository_state/NON_Public:infrastructure/SLE_15_SP3 as Leap 15.3 == SLE 15 SP3 at least regarding building packages.
Updated by mkittler almost 3 years ago
- Status changed from Feedback to In Progress
Somehow I've missed the chat messages but that answer is good enough for me :-)
I'll try using the SLE_15_SP3 repo then.
Updated by openqa_review almost 3 years ago
- Due date set to 2021-11-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Feedback
Since the Grafana dashboard shows the fail ratio for all jobs I've been executing manual queries for the failure rate before and after the upgrade.
Before:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished < '2021-10-28' group by host;
host | ratio_failed_by_host
--------------------+----------------------
openqaworker-arm-1 | 17.72
openqaworker-arm-2 | 15.04
openqaworker-arm-3 | 14.8
openqaworker-arm-4 | 39.14
openqaworker-arm-5 | 37.54
(5 Zeilen)
After:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
host | ratio_failed_by_host
--------------------+----------------------
openqaworker-arm-1 | 14.81
openqaworker-arm-2 | 11.57
openqaworker-arm-3 | 6.13
(3 Zeilen)
I'll re-run the 2nd query again after a few days to see how it changes after more jobs have been processed.
(ARM 4 and 5 don't show up in the 2nd table because their worker class has been changed so they don't run jobs anymore at the moment.)
Note that after updating o3 workers (not aarch64 workers specifically) to Leap 15.3 problems with the new QEMU version came up leading to failures like https://openqa.opensuse.org/tests/1997053#step/welcome/5. I am not sure whether this issue is actually related, though.
Updated by mkittler almost 3 years ago
By the way, I've dug a bit in the job history of ARM 4 and 5 and the failing jobs were failing for various different reasons which makes the problem hard to pin-down. I also haven't spotted an USB boot job so the issue mentioned in the previous comment is likely unrelated.
Updated by mkittler almost 3 years ago
After the upgrade, arm-3's failure rate is still pretty much the same:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
host | ratio_failed_by_host
--------------------+----------------------
openqaworker-arm-1 | 13.25
openqaworker-arm-2 | 13.14
openqaworker-arm-3 | 13.61
(3 Zeilen)
Updated by mkittler almost 3 years ago
It looks still good, but so does arm-4 and arm-5 now (after @kraih enabled them again):
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
host | ratio_failed_by_host
--------------------+----------------------
openqaworker-arm-1 | 12.4
openqaworker-arm-2 | 12.78
openqaworker-arm-3 | 12.23
openqaworker-arm-4 | 7.14
openqaworker-arm-5 | 11.11
(5 Zeilen)
Updated by mkittler almost 3 years ago
- Status changed from Feedback to Resolved
I think > 2000 jobs are enough to tell that installing Leap 15.3 did not increase the failure rate on arm-3:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
host | ratio_failed_by_host | total
--------------------+----------------------+-------
openqaworker-arm-1 | 11.2 | 1214
openqaworker-arm-2 | 11.73 | 2216
openqaworker-arm-3 | 11.28 | 2305
openqaworker-arm-4 | 9.09 | 88
openqaworker-arm-5 | 17.5 | 80
When looking for incomplete jobs the figures look similar.
By the way, the Kernel arm-3 has run on since the Leap 15.3 update is:
martchus@openqaworker-arm-3:~> uname -a
Linux openqaworker-arm-3 5.3.18-59.27-default #1 SMP Tue Oct 5 10:00:40 UTC 2021 (7df2404) aarch64 aarch64 aarch64 GNU/Linux