action #34006: [opensuse][functional][u] detect cpu soft lockup on leap 42.3 - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #34006

closed

[opensuse][functional][u] detect cpu soft lockup on leap 42.3

Added by riafarov about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

jorauch

Category:

Enhancement to existing tests

Target version:

SUSE QA (private) - Milestone 22

Start date:

2018-03-29

Due date:

% Done:

Estimated time:

Difficulty:

Description

Motivation¶

As a part of investigation of #30085, we have found out that reason of those failures is cpu soft lock up bug, which is in kernel version we have in leap 42.3 (see bsc#1052258).

We tried to improve this, see PR gh#os-autoinst/os-autoinst-distri-opensuse/4704 but feature implemented in #28027 didn't work for me, and other solution had multiple issues, so it resulted in this ticket to create scalable solution which can be reused for similar scenarios.

Acceptance criteria¶

AC1: In case cpu lock up is detected, hint about root cause is logged for the reviewer
AC2: Implementation not limited to a simple test module but more scalable

Suggestions¶

Make feature implemented in #28027 work for this scenario, and potentially improve it to contain custom message when pattern is detected

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by riafarov about 7 years ago

Related to action #30085: [functional][u][medium] test fails in updates_packagekit_gpk - no restarting packagekit daemon after libzypp update added

Actions

Copy link

Updated by okurz about 7 years ago

Subject changed from [opensuse][functional] detect cpu soft lockup on leap 42.3 to [opensuse][functional][u] detect cpu soft lockup on leap 42.3
Description updated (diff)
Due date set to 2018-05-22
Status changed from New to Workable
Target version set to Milestone 16

Actions

Copy link

Updated by okurz about 7 years ago

Due date deleted (~~2018-05-22~~)
Target version changed from Milestone 16 to Milestone 19

It's a good idea but we should rather focus on other tasks for the time being.

Actions

Copy link

Updated by okurz almost 7 years ago

Target version changed from Milestone 19 to Milestone 19

Actions

Copy link

Updated by okurz over 6 years ago

Target version changed from Milestone 19 to future

Actions

Copy link

Updated by riafarov over 6 years ago

We have serial detection now, so it would be one line change, so what's about moving it to Milestone 20/21

Actions

Copy link

Updated by okurz over 6 years ago

Yes, I would like to see this in action but QSF-u needs to work on other priorities first and the re-planning effort recently is high so I would like to keep it in future. Of course, in the idealistic case that we would be done with all tasks in M20/21 sooner we can still pick it up :)

Actions

Copy link

Updated by jorauch over 6 years ago

Assignee set to jorauch
Target version changed from future to Milestone 21

Taking over

Actions

Copy link

Updated by okurz over 6 years ago

+1 as discussed in the sprint planning meeting today

Actions

Copy link

#10

Updated by jorauch over 6 years ago

Status changed from Workable to In Progress

To me it looks like this has already been done in:
https://github.com/os-autoinst/os-autoinst/pull/932
Wdyt?

Actions

Copy link

#11

Updated by okurz over 6 years ago

Well, the PR you mentioned just adds the feature to be able to add detections but we have to add the actual check patterns. Please see the example https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4704 mentioned in the description as well.

Actions

Copy link

#12

Updated by jorauch over 6 years ago

This PR:
https://github.com/os-autoinst/os-autoinst/pull/977
contains the documentation how to use the serial detection feature:
$testapi::distri->set_expected_serial_failures(soft=>{"AWESOME SOFT MSG 1"=>[qr/gcc version/], "AWESOME SOFT MSG 2"=>[qr/insmod error: 1/]}, hard=>{"AWESOME HARD MSG"=>[qr/No iBFT detected/]});

Actions

Copy link

#13

Updated by jorauch over 6 years ago

We should add an extra module for checking the serial log for all known failures

Actions

Copy link

#14

Updated by jorauch over 6 years ago

We have https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/caasp/journal_check.pm
to check for serial failures, so I will now try to generalize that code and execute it in the post_fail_hook and in best case before the ending of every test_suite

Following elements should be considered:

A hash with bugreferences and a pattern to recognize them
A function that checks the journal and records the bugs, called in the general post_fail_hook
set_expected_failures to recognize stopper bugs on worker level

Actions

Copy link

#15

Updated by okurz over 6 years ago

jorauch wrote:

We have https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/caasp/journal_check.pm
to check for serial failures

don't confuse "checking the journal from within the SUT" to "check for errors reported on the serial terminal checked by the worker", two different things even though the journal and the serial terminal might include the same or related messages.

Actions

Copy link

#16

Updated by jorauch over 6 years ago

Agreed with okurz to:

put the creation of the 'bug table' in a separate file
keep the detection on worker
Check in boot_to_desktop for a welcome message to verify how the detection works

Actions

Copy link

#17

Updated by jorauch over 6 years ago

The latest implementation of the needed feature:
https://github.com/os-autoinst/os-autoinst/pull/998

Actions

Copy link

#18

Updated by jorauch over 6 years ago

Created WIP PR:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6398

Actions

Copy link

#19

Updated by jorauch over 6 years ago

Finally working:
http://pinky.arch.suse.de/tests/1685#step/boot_to_desktop/4

Actions

Copy link

#20

Updated by mgriessmeier over 6 years ago

Status changed from In Progress to Feedback

waiting for PR to be merged

Actions

Copy link

#21

Updated by okurz over 6 years ago

PR merged

Actions

Copy link

#22

Updated by okurz over 6 years ago

broke a lot of tests, fix applied, retriggered like 1k jobs, should be fine now.

Actions

Copy link

#23

Updated by okurz over 6 years ago

Priority changed from Normal to High
Target version changed from Milestone 21 to Milestone 22

median cycle time exceeded -> bumping prio and target version to current milestone

Actions

Copy link

#24

Updated by jorauch over 6 years ago

Since we have no more outtages by this we should be able to close the ticket?

Actions

Copy link

#25

Updated by okurz over 6 years ago

Sure, if both ACs are covered and we have a verification run on production

Actions

Copy link

#26

Updated by jorauch over 6 years ago

Status changed from Feedback to Resolved

zluo has seen this in the wild, so I am closing the ticket now

Actions

Copy link

#27

Updated by okurz over 6 years ago

Status changed from Resolved to Feedback

I am still not sure we talk about the same thing. So we have never seen the issue being detected and marked as such? It's not about seeing that jobs fail due to "cpu soft lockup", I guess zluo got this confused. It's about the tests detecting the situation themselves. Can we have a verification for that?

Actions

Copy link

#28

Updated by jorauch over 6 years ago

We can wait until we see this by luck or we have some really new and cool bug that goes through enough builds so we can use the new mechanics to detect it in production.
But imho that's a waste of time because that would be completely random and we cannot even estimate when it will happen.
I think a normal verification run must be sufficient here

Actions

Copy link

#29

Updated by szarate over 6 years ago

I agree with Jojo here... However, there's always the trick of turning off the host's thread or cpu while the machine is running? if such thing is possible :P

Actions

Copy link

#30

Updated by mgriessmeier over 6 years ago

Status changed from Feedback to Resolved

as agreed in review -> resolving

Actions

Copy link

#31

Updated by okurz over 6 years ago

Related to action #45530: [aarch64] system_workarounds.pm triggers lib/known_bugs serial detection which abort whole test suite added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #34006

[opensuse][functional][u] detect cpu soft lockup on leap 42.3

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by riafarov about 7 years ago

Updated by okurz about 7 years ago

Updated by okurz about 7 years ago

Updated by okurz almost 7 years ago

Updated by okurz over 6 years ago

Updated by riafarov over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by jorauch over 6 years ago

Updated by jorauch over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by jorauch over 6 years ago

Updated by jorauch over 6 years ago

Updated by jorauch over 6 years ago

Updated by mgriessmeier over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by okurz over 6 years ago

Updated by jorauch over 6 years ago

Updated by szarate over 6 years ago

Updated by mgriessmeier over 6 years ago

Updated by okurz over 6 years ago