action #101271: Try Kernel:stable on arm4+arm5 and compare failure rate size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #101271

closed

openQA Project (public) - coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Try Kernel:stable on arm4+arm5 and compare failure rate size:M

Added by okurz about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

kraih

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-10-15

Due date:

% Done:

Estimated time:

Description

Observation¶

According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%

Acceptance criteria¶

AC1: arm4 or arm5 is running the Linux kernel from build.opensuse.org/project/show/Kernel:stable
AC2: The fail-ratio is known from a sufficiently large set and compared against the previously known value (33-36%)

Suggestions¶

Install kernel from build.opensuse.org/project/show/Kernel:stable
Reboot into the new kernel
Schedule many tests on the upgraded machine
Gather fail ratio, e.g. follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation (or from grafana or database manually) and compare

History
Notes
Property changes

Actions

Copy link

Updated by livdywan about 3 years ago

Status changed from New to Workable

Actions

Copy link

Updated by kraih about 3 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih about 3 years ago

Status changed from Workable to In Progress

Actions

Copy link

Updated by openqa_review about 3 years ago

Due date set to 2021-11-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler about 3 years ago

Looks like the newly added repo hasn't been configured with auto-refresh enabled leading to errors when updating, see #101779#note-5. I was so free to enable auto-refresh of the repo on arm-4 and arm-5 to fix #101779. This has now actually installed the stable kernel versions (vendor switch from SUSE LLC https://www.suse.com/ to obs://build.opensuse.org/Kernel was done). I assume this was intended. I'll leave it to you to actually let the machines boot into the different kernel. Note that I've also adjusted the repository priority for the kernel repo so there's a clear configuration which packages should take precedence.

Actions

Copy link

Updated by kraih about 3 years ago

mkittler wrote:

This has now actually installed the stable kernel versions (vendor switch from SUSE LLC https://www.suse.com/ to obs://build.opensuse.org/Kernel was done). I assume this was intended. I'll leave it to you to actually let the machines boot into the different kernel. Note that I've also adjusted the repository priority for the kernel repo so there's a clear configuration which packages should take precedence.

The machines were already running the stable kernel. I upgraded them from 5.3.18-59.27-default to 5.14.14-lp153.3.g2b5383f-default on friday.

Actions

Copy link

Updated by mkittler about 3 years ago

You're right. The vendor change only concerned kernel-firmware-* packages and a few utilities. The kernel itself was not updated anymore.

Actions

Copy link

Updated by kraih about 3 years ago

Status changed from In Progress to Feedback

Collecting data now.

Actions

Copy link

Updated by kraih about 3 years ago

~~All arm workers seemed pretty busy, so i've not cloned any jobs yet.~~ Didn't know salt would keep changing the class, making ad-hoc testing rather annoying. So i'm now cloning jobs.

Actions

Copy link

#10

Updated by mkittler about 3 years ago

So far it looks good, see #101265#note-12 - although the number of jobs which have been executed is still rather small:

openqa=> with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-10-28' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                 12.4 |  1008
 openqaworker-arm-2 |                12.79 |  1876
 openqaworker-arm-3 |                12.16 |  1957
 openqaworker-arm-4 |                 7.14 |    14
 openqaworker-arm-5 |                11.11 |    18
(5 Zeilen)

Actions

Copy link

#11

Updated by kraih about 3 years ago

Cloned a few more jobs randomly, and it seems fine so far. Going to activate them in Salt again and keep an eye on it over the next few days.

        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                12.15 |  1078
 openqaworker-arm-2 |                12.51 |  1982
 openqaworker-arm-3 |                12.04 |  2060
 openqaworker-arm-4 |                 6.38 |    47
 openqaworker-arm-5 |                11.32 |    53

Actions

Copy link

#12

Updated by kraih about 3 years ago

The fact that arm-4/5 are getting a whole lot less jobs does appear to skew the results a little bit. I assume it's because they don't have the tap class.

openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-05' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                 2.22 |   135
 openqaworker-arm-2 |                 3.43 |   233
 openqaworker-arm-3 |                  4.4 |   250
 openqaworker-arm-4 |                 12.2 |    41
 openqaworker-arm-5 |                29.63 |    27
(5 rows)

Actions

Copy link

#13

Updated by kraih about 3 years ago

Not sure what to make of the results from the weekend.

openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-05' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                 7.68 |   456
 openqaworker-arm-2 |                10.35 |   889
 openqaworker-arm-3 |                 8.49 |   931
 openqaworker-arm-4 |                41.67 |   180
 openqaworker-arm-5 |                34.97 |   183
(5 rows)

At first glance it looks not so good, but at the same time arm-4/5 got much less jobs than arm-1/2/3. Not sure we can get an actually useful comparison without setting up tap on arm-4/5.

Actions

Copy link

#14

Updated by okurz about 3 years ago

I suggest to schedule many more jobs, e.g. schedule 1k-10k jobs. Don't be afraid of the load because the machines aren't used for production right now anyway. Then from that get the fail ratio even if it's non-multi-machine tests, i.e. without "tap". If the fail ratio is in the range of <15% then you can either schedule multi-machine tests or - if you are careful and monitor closely - then bring the machines into production and handle all unreviewed test failures quickly to not confuse test reviewers.

Actions

Copy link

#15

Updated by kraih about 3 years ago

okurz wrote:

I suggest to schedule many more jobs, e.g. schedule 1k-10k jobs. Don't be afraid of the load because the machines aren't used for production right now anyway.

They were in production over the weekend, i'll take them out again now and start a synthetic stress test. Maybe those results will be more helpful.

Actions

Copy link

#16

Updated by kraih about 3 years ago

Results for the latest test so far (still running):

openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-09' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                 4.67 |   150
 openqaworker-arm-2 |                 5.78 |   329
 openqaworker-arm-3 |                 6.45 |   310
 openqaworker-arm-4 |                26.23 |   122
 openqaworker-arm-5 |                22.76 |   123
(5 rows)

Actions

Copy link

#17

Updated by okurz about 3 years ago

Discussed in SUSE QE Tools midweekly unblock 2021-11-10. The numbers from #16 already show convincing numbers. Let's await the results from most or all the jobs you scheduled. After that we should continue with the other tasks as described in the epic.

Actions

Copy link

#18

Updated by kraih about 3 years ago

Slightly higher fail rate for all arm workers today. (Experiment still ongoing)

openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-09' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                10.95 |   201
 openqaworker-arm-2 |                14.61 |   479
 openqaworker-arm-3 |                11.82 |   440
 openqaworker-arm-4 |                29.72 |   212
 openqaworker-arm-5 |                23.94 |   213
(5 rows)

Actions

Copy link

#19

Updated by kraih about 3 years ago

And slightly lower again today. (Experiment is still ongoing)

openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%-arm-%' and t_finished >= '2021-11-09' group by host;
        host        | ratio_failed_by_host | total 
--------------------+----------------------+-------
 openqaworker-arm-1 |                 6.74 |   341
 openqaworker-arm-2 |                 9.85 |   822
 openqaworker-arm-3 |                 8.36 |   730
 openqaworker-arm-4 |                27.92 |   308
 openqaworker-arm-5 |                23.89 |   314
(5 rows)

We can probably call it after the weekend.

Actions

Copy link

#20

Updated by okurz about 3 years ago

Due date deleted (~~2021-11-13~~)
Status changed from Feedback to Resolved

I consider the results sufficient. The conclusion: "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one. I updated the epic. Thanks for the work!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #101271

Try Kernel:stable on arm4+arm5 and compare failure rate size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by openqa_review about 3 years ago

Updated by mkittler about 3 years ago

Updated by kraih about 3 years ago

Updated by mkittler about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by mkittler about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by okurz about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by okurz about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by okurz about 3 years ago