Project

General

Profile

Actions

action #34006

closed

[opensuse][functional][u] detect cpu soft lockup on leap 42.3

Added by riafarov about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Enhancement to existing tests
Target version:
SUSE QA - Milestone 22
Start date:
2018-03-29
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

As a part of investigation of #30085, we have found out that reason of those failures is cpu soft lock up bug, which is in kernel version we have in leap 42.3 (see bsc#1052258).

We tried to improve this, see PR gh#os-autoinst/os-autoinst-distri-opensuse/4704 but feature implemented in #28027 didn't work for me, and other solution had multiple issues, so it resulted in this ticket to create scalable solution which can be reused for similar scenarios.

Acceptance criteria

  • AC1: In case cpu lock up is detected, hint about root cause is logged for the reviewer
  • AC2: Implementation not limited to a simple test module but more scalable

Suggestions

Make feature implemented in #28027 work for this scenario, and potentially improve it to contain custom message when pattern is detected


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #30085: [functional][u][medium] test fails in updates_packagekit_gpk - no restarting packagekit daemon after libzypp updateResolvedriafarov2018-01-092018-04-10

Actions
Related to openQA Tests - action #45530: [aarch64] system_workarounds.pm triggers lib/known_bugs serial detection which abort whole test suiteClosedggardet_arm2018-12-24

Actions
Actions #1

Updated by riafarov about 6 years ago

  • Related to action #30085: [functional][u][medium] test fails in updates_packagekit_gpk - no restarting packagekit daemon after libzypp update added
Actions #2

Updated by okurz about 6 years ago

  • Subject changed from [opensuse][functional] detect cpu soft lockup on leap 42.3 to [opensuse][functional][u] detect cpu soft lockup on leap 42.3
  • Description updated (diff)
  • Due date set to 2018-05-22
  • Status changed from New to Workable
  • Target version set to Milestone 16
Actions #3

Updated by okurz almost 6 years ago

  • Due date deleted (2018-05-22)
  • Target version changed from Milestone 16 to Milestone 19

It's a good idea but we should rather focus on other tasks for the time being.

Actions #4

Updated by okurz almost 6 years ago

  • Target version changed from Milestone 19 to Milestone 19
Actions #5

Updated by okurz over 5 years ago

  • Target version changed from Milestone 19 to future
Actions #6

Updated by riafarov over 5 years ago

We have serial detection now, so it would be one line change, so what's about moving it to Milestone 20/21

Actions #7

Updated by okurz over 5 years ago

Yes, I would like to see this in action but QSF-u needs to work on other priorities first and the re-planning effort recently is high so I would like to keep it in future. Of course, in the idealistic case that we would be done with all tasks in M20/21 sooner we can still pick it up :)

Actions #8

Updated by jorauch over 5 years ago

  • Assignee set to jorauch
  • Target version changed from future to Milestone 21

Taking over

Actions #9

Updated by okurz over 5 years ago

+1 as discussed in the sprint planning meeting today

Actions #10

Updated by jorauch over 5 years ago

  • Status changed from Workable to In Progress

To me it looks like this has already been done in:
https://github.com/os-autoinst/os-autoinst/pull/932
Wdyt?

Actions #11

Updated by okurz over 5 years ago

Well, the PR you mentioned just adds the feature to be able to add detections but we have to add the actual check patterns. Please see the example https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4704 mentioned in the description as well.

Actions #12

Updated by jorauch over 5 years ago

This PR:
https://github.com/os-autoinst/os-autoinst/pull/977
contains the documentation how to use the serial detection feature:
$testapi::distri->set_expected_serial_failures(soft=>{"AWESOME SOFT MSG 1"=>[qr/gcc version/], "AWESOME SOFT MSG 2"=>[qr/insmod error: 1/]}, hard=>{"AWESOME HARD MSG"=>[qr/No iBFT detected/]});

Actions #13

Updated by jorauch over 5 years ago

We should add an extra module for checking the serial log for all known failures

Actions #14

Updated by jorauch over 5 years ago

We have https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/caasp/journal_check.pm
to check for serial failures, so I will now try to generalize that code and execute it in the post_fail_hook and in best case before the ending of every test_suite

Following elements should be considered:

  • A hash with bugreferences and a pattern to recognize them
  • A function that checks the journal and records the bugs, called in the general post_fail_hook
  • set_expected_failures to recognize stopper bugs on worker level
Actions #15

Updated by okurz over 5 years ago

jorauch wrote:

We have https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/caasp/journal_check.pm
to check for serial failures

don't confuse "checking the journal from within the SUT" to "check for errors reported on the serial terminal checked by the worker", two different things even though the journal and the serial terminal might include the same or related messages.

Actions #16

Updated by jorauch over 5 years ago

Agreed with okurz to:

  • put the creation of the 'bug table' in a separate file
  • keep the detection on worker
  • Check in boot_to_desktop for a welcome message to verify how the detection works
Actions #17

Updated by jorauch over 5 years ago

The latest implementation of the needed feature:
https://github.com/os-autoinst/os-autoinst/pull/998

Actions #20

Updated by mgriessmeier over 5 years ago

  • Status changed from In Progress to Feedback

waiting for PR to be merged

Actions #21

Updated by okurz over 5 years ago

PR merged

Actions #22

Updated by okurz over 5 years ago

broke a lot of tests, fix applied, retriggered like 1k jobs, should be fine now.

Actions #23

Updated by okurz over 5 years ago

  • Priority changed from Normal to High
  • Target version changed from Milestone 21 to Milestone 22

median cycle time exceeded -> bumping prio and target version to current milestone

Actions #24

Updated by jorauch over 5 years ago

Since we have no more outtages by this we should be able to close the ticket?

Actions #25

Updated by okurz over 5 years ago

Sure, if both ACs are covered and we have a verification run on production

Actions #26

Updated by jorauch over 5 years ago

  • Status changed from Feedback to Resolved

zluo has seen this in the wild, so I am closing the ticket now

Actions #27

Updated by okurz over 5 years ago

  • Status changed from Resolved to Feedback

I am still not sure we talk about the same thing. So we have never seen the issue being detected and marked as such? It's not about seeing that jobs fail due to "cpu soft lockup", I guess zluo got this confused. It's about the tests detecting the situation themselves. Can we have a verification for that?

Actions #28

Updated by jorauch over 5 years ago

We can wait until we see this by luck or we have some really new and cool bug that goes through enough builds so we can use the new mechanics to detect it in production.
But imho that's a waste of time because that would be completely random and we cannot even estimate when it will happen.
I think a normal verification run must be sufficient here

Actions #29

Updated by szarate over 5 years ago

I agree with Jojo here... However, there's always the trick of turning off the host's thread or cpu while the machine is running? if such thing is possible :P

Actions #30

Updated by mgriessmeier over 5 years ago

  • Status changed from Feedback to Resolved

as agreed in review -> resolving

Actions #31

Updated by okurz over 5 years ago

  • Related to action #45530: [aarch64] system_workarounds.pm triggers lib/known_bugs serial detection which abort whole test suite added
Actions

Also available in: Atom PDF