action #34006
closed[opensuse][functional][u] detect cpu soft lockup on leap 42.3
0%
Description
Motivation¶
As a part of investigation of #30085, we have found out that reason of those failures is cpu soft lock up bug, which is in kernel version we have in leap 42.3 (see bsc#1052258).
We tried to improve this, see PR gh#os-autoinst/os-autoinst-distri-opensuse/4704 but feature implemented in #28027 didn't work for me, and other solution had multiple issues, so it resulted in this ticket to create scalable solution which can be reused for similar scenarios.
Acceptance criteria¶
- AC1: In case cpu lock up is detected, hint about root cause is logged for the reviewer
- AC2: Implementation not limited to a simple test module but more scalable
Suggestions¶
Make feature implemented in #28027 work for this scenario, and potentially improve it to contain custom message when pattern is detected
Updated by riafarov over 6 years ago
- Related to action #30085: [functional][u][medium] test fails in updates_packagekit_gpk - no restarting packagekit daemon after libzypp update added
Updated by okurz over 6 years ago
- Subject changed from [opensuse][functional] detect cpu soft lockup on leap 42.3 to [opensuse][functional][u] detect cpu soft lockup on leap 42.3
- Description updated (diff)
- Due date set to 2018-05-22
- Status changed from New to Workable
- Target version set to Milestone 16
Updated by okurz over 6 years ago
- Due date deleted (
2018-05-22) - Target version changed from Milestone 16 to Milestone 19
It's a good idea but we should rather focus on other tasks for the time being.
Updated by okurz over 6 years ago
- Target version changed from Milestone 19 to Milestone 19
Updated by okurz about 6 years ago
- Target version changed from Milestone 19 to future
Updated by riafarov about 6 years ago
We have serial detection now, so it would be one line change, so what's about moving it to Milestone 20/21
Updated by okurz about 6 years ago
Yes, I would like to see this in action but QSF-u needs to work on other priorities first and the re-planning effort recently is high so I would like to keep it in future. Of course, in the idealistic case that we would be done with all tasks in M20/21 sooner we can still pick it up :)
Updated by jorauch almost 6 years ago
- Assignee set to jorauch
- Target version changed from future to Milestone 21
Taking over
Updated by okurz almost 6 years ago
+1 as discussed in the sprint planning meeting today
Updated by jorauch almost 6 years ago
- Status changed from Workable to In Progress
To me it looks like this has already been done in:
https://github.com/os-autoinst/os-autoinst/pull/932
Wdyt?
Updated by okurz almost 6 years ago
Well, the PR you mentioned just adds the feature to be able to add detections but we have to add the actual check patterns. Please see the example https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4704 mentioned in the description as well.
Updated by jorauch almost 6 years ago
This PR:
https://github.com/os-autoinst/os-autoinst/pull/977
contains the documentation how to use the serial detection feature:
$testapi::distri->set_expected_serial_failures(soft=>{"AWESOME SOFT MSG 1"=>[qr/gcc version/], "AWESOME SOFT MSG 2"=>[qr/insmod error: 1/]}, hard=>{"AWESOME HARD MSG"=>[qr/No iBFT detected/]});
Updated by jorauch almost 6 years ago
We should add an extra module for checking the serial log for all known failures
Updated by jorauch almost 6 years ago
We have https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/caasp/journal_check.pm
to check for serial failures, so I will now try to generalize that code and execute it in the post_fail_hook and in best case before the ending of every test_suite
Following elements should be considered:
- A hash with bugreferences and a pattern to recognize them
- A function that checks the journal and records the bugs, called in the general post_fail_hook
set_expected_failures
to recognize stopper bugs on worker level
Updated by okurz almost 6 years ago
jorauch wrote:
We have https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/caasp/journal_check.pm
to check for serial failures
don't confuse "checking the journal from within the SUT" to "check for errors reported on the serial terminal checked by the worker", two different things even though the journal and the serial terminal might include the same or related messages.
Updated by jorauch almost 6 years ago
Agreed with okurz to:
- put the creation of the 'bug table' in a separate file
- keep the detection on worker
- Check in boot_to_desktop for a welcome message to verify how the detection works
Updated by jorauch almost 6 years ago
The latest implementation of the needed feature:
https://github.com/os-autoinst/os-autoinst/pull/998
Updated by jorauch almost 6 years ago
Updated by jorauch almost 6 years ago
Finally working:
http://pinky.arch.suse.de/tests/1685#step/boot_to_desktop/4
Updated by mgriessmeier almost 6 years ago
- Status changed from In Progress to Feedback
waiting for PR to be merged
Updated by okurz almost 6 years ago
broke a lot of tests, fix applied, retriggered like 1k jobs, should be fine now.
Updated by okurz almost 6 years ago
- Priority changed from Normal to High
- Target version changed from Milestone 21 to Milestone 22
median cycle time exceeded -> bumping prio and target version to current milestone
Updated by jorauch almost 6 years ago
Since we have no more outtages by this we should be able to close the ticket?
Updated by okurz almost 6 years ago
Sure, if both ACs are covered and we have a verification run on production
Updated by jorauch almost 6 years ago
- Status changed from Feedback to Resolved
zluo has seen this in the wild, so I am closing the ticket now
Updated by okurz almost 6 years ago
- Status changed from Resolved to Feedback
I am still not sure we talk about the same thing. So we have never seen the issue being detected and marked as such? It's not about seeing that jobs fail due to "cpu soft lockup", I guess zluo got this confused. It's about the tests detecting the situation themselves. Can we have a verification for that?
Updated by jorauch almost 6 years ago
We can wait until we see this by luck or we have some really new and cool bug that goes through enough builds so we can use the new mechanics to detect it in production.
But imho that's a waste of time because that would be completely random and we cannot even estimate when it will happen.
I think a normal verification run must be sufficient here
Updated by szarate almost 6 years ago
I agree with Jojo here... However, there's always the trick of turning off the host's thread or cpu while the machine is running? if such thing is possible :P
Updated by mgriessmeier almost 6 years ago
- Status changed from Feedback to Resolved
as agreed in review -> resolving
Updated by okurz almost 6 years ago
- Related to action #45530: [aarch64] system_workarounds.pm triggers lib/known_bugs serial detection which abort whole test suite added