action #101271
openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
openQA Project - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
Try Kernel:stable on arm4+arm5 and compare failure rate size:M
0%
Description
Observation¶
According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%
Acceptance criteria¶
- AC1: arm4 or arm5 is running the Linux kernel from build.opensuse.org/project/show/Kernel:stable
- AC2: The fail-ratio is known from a sufficiently large set and compared against the previously known value (33-36%)
Suggestions¶
- Install kernel from build.opensuse.org/project/show/Kernel:stable
- Reboot into the new kernel
- Schedule many tests on the upgraded machine
- Gather fail ratio, e.g. follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation (or from grafana or database manually) and compare
History
#4
Updated by openqa_review 7 months ago
- Due date set to 2021-11-13
Setting due date based on mean cycle time of SUSE QE Tools
#5
Updated by mkittler 7 months ago
Looks like the newly added repo hasn't been configured with auto-refresh enabled leading to errors when updating, see #101779#note-5. I was so free to enable auto-refresh of the repo on arm-4 and arm-5 to fix #101779. This has now actually installed the stable kernel versions (vendor switch from SUSE LLC https://www.suse.com/ to obs://build.opensuse.org/Kernel was done). I assume this was intended. I'll leave it to you to actually let the machines boot into the different kernel. Note that I've also adjusted the repository priority for the kernel repo so there's a clear configuration which packages should take precedence.
#6
Updated by kraih 7 months ago
mkittler wrote:
This has now actually installed the stable kernel versions (vendor switch from SUSE LLC https://www.suse.com/ to obs://build.opensuse.org/Kernel was done). I assume this was intended. I'll leave it to you to actually let the machines boot into the different kernel. Note that I've also adjusted the repository priority for the kernel repo so there's a clear configuration which packages should take precedence.
The machines were already running the stable kernel. I upgraded them from 5.3.18-59.27-default to 5.14.14-lp153.3.g2b5383f-default on friday.
#10
Updated by mkittler 7 months ago
So far it looks good, see #101265#note-12 - although the number of jobs which have been executed is still rather small:
openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host; host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 12.4 | 1008 openqaworker-arm-2 | 12.79 | 1876 openqaworker-arm-3 | 12.16 | 1957 openqaworker-arm-4 | 7.14 | 14 openqaworker-arm-5 | 11.11 | 18 (5 Zeilen)
#11
Updated by kraih 7 months ago
Cloned a few more jobs randomly, and it seems fine so far. Going to activate them in Salt again and keep an eye on it over the next few days.
host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 12.15 | 1078 openqaworker-arm-2 | 12.51 | 1982 openqaworker-arm-3 | 12.04 | 2060 openqaworker-arm-4 | 6.38 | 47 openqaworker-arm-5 | 11.32 | 53
#12
Updated by kraih 7 months ago
The fact that arm-4/5 are getting a whole lot less jobs does appear to skew the results a little bit. I assume it's because they don't have the tap
class.
openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-05' group by host; host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 2.22 | 135 openqaworker-arm-2 | 3.43 | 233 openqaworker-arm-3 | 4.4 | 250 openqaworker-arm-4 | 12.2 | 41 openqaworker-arm-5 | 29.63 | 27 (5 rows)
#13
Updated by kraih 7 months ago
Not sure what to make of the results from the weekend.
openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-05' group by host; host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 7.68 | 456 openqaworker-arm-2 | 10.35 | 889 openqaworker-arm-3 | 8.49 | 931 openqaworker-arm-4 | 41.67 | 180 openqaworker-arm-5 | 34.97 | 183 (5 rows)
At first glance it looks not so good, but at the same time arm-4/5 got much less jobs than arm-1/2/3. Not sure we can get an actually useful comparison without setting up tap
on arm-4/5.
#14
Updated by okurz 7 months ago
I suggest to schedule many more jobs, e.g. schedule 1k-10k jobs. Don't be afraid of the load because the machines aren't used for production right now anyway. Then from that get the fail ratio even if it's non-multi-machine tests, i.e. without "tap". If the fail ratio is in the range of <15% then you can either schedule multi-machine tests or - if you are careful and monitor closely - then bring the machines into production and handle all unreviewed test failures quickly to not confuse test reviewers.
#15
Updated by kraih 7 months ago
okurz wrote:
I suggest to schedule many more jobs, e.g. schedule 1k-10k jobs. Don't be afraid of the load because the machines aren't used for production right now anyway.
They were in production over the weekend, i'll take them out again now and start a synthetic stress test. Maybe those results will be more helpful.
#16
Updated by kraih 7 months ago
Results for the latest test so far (still running):
openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-09' group by host; host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 4.67 | 150 openqaworker-arm-2 | 5.78 | 329 openqaworker-arm-3 | 6.45 | 310 openqaworker-arm-4 | 26.23 | 122 openqaworker-arm-5 | 22.76 | 123 (5 rows)
#18
Updated by kraih 7 months ago
Slightly higher fail rate for all arm workers today. (Experiment still ongoing)
openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-09' group by host; host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 10.95 | 201 openqaworker-arm-2 | 14.61 | 479 openqaworker-arm-3 | 11.82 | 440 openqaworker-arm-4 | 29.72 | 212 openqaworker-arm-5 | 23.94 | 213 (5 rows)
#19
Updated by kraih 7 months ago
And slightly lower again today. (Experiment is still ongoing)
openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-09' group by host; host | ratio_failed_by_host | total --------------------+----------------------+------- openqaworker-arm-1 | 6.74 | 341 openqaworker-arm-2 | 9.85 | 822 openqaworker-arm-3 | 8.36 | 730 openqaworker-arm-4 | 27.92 | 308 openqaworker-arm-5 | 23.89 | 314 (5 rows)
We can probably call it after the weekend.