action #10584
closed[s390] sporadic fail in bootloader_s390, e.g. downloading kernel and initrd
100%
Description
observation¶
tests are still failing sometimes in bootloader_s390 due to unexpected behaviour when downloading kernel and initrd...
steps to reproduce¶
- "clone_job.pl" on any s390x@zVM job often enough, at least 20 times, to see the issue
problem¶
H1. REJECTED: too many jobs started at once cause some timeout on download --> random back-off time in qaboot did not remedy all problems, reverted changes
H2. ACCEPTED: Network download can stall some times and we can't control this --> ~~undecided if this is true, still ~~ "retry" as workaround applied, see #10584#note-18
H3. REJECTED: Network download is just slower sometimes, causing progress notification to be output more often, causing one buffer window of x3270 to fill up and the function expect_s3270
in the backend is not handling this correctly. --> expect_s3270 was behaving correctly just having wrong comments and misleading output, see #10584#note-21
suggestion¶
some tasks have been tried, workarounds applied, what is left to do:- DONE: Fix handling of "status line matches neither buffer_ready nor buffer_full:" in
expect_s3270
to accept either H3 or H2 and reject the other -> expect_s3270 was actually fine but the message "status line matches neither buffer_ready nor buffer_full" was very confusing and improved in gh-osautoinst#514
further details¶
the former ftpboot fix (#10224) seems to be not the only issue with that
so i will investigate this further
Updated by mgriessmeier over 8 years ago
- Copied from action #10224: [s390] modify ftpboot to prevent tests from failing sometimes when multiple tests are started at once added
Updated by okurz over 8 years ago
- Subject changed from [s390] tests are still failing sometimes in bootloader_s390 due to unexpected behaviour when downloading kernel and initrd... to [s390] sporadic fail in bootloader_s390 downloading kernel and initrd
- Description updated (diff)
Updated by mgriessmeier about 8 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Issue was not seen for a long time after the first workaround, so the assumption is that this was a performance issue on our mainframe
-> closing for now
Updated by mgriessmeier about 8 years ago
- Status changed from Resolved to In Progress
reopened due to recent fails...
--> adjusted the random wait time for establishing a ftp connection in ftpboot to 90seconds
Updated by okurz about 8 years ago
- Has duplicate action #11910: s390 bootloader fails in https://openqa.suse.de/tests/383606 added
Updated by okurz almost 8 years ago
recent example: https://openqa.suse.de/tests/396484
where can we mangle with the ftpboot stuff? Or can we just retry from the backend or tests?
Updated by okurz almost 8 years ago
https://openqa.suse.de/tests/400294/modules/bootloader_s390/steps/4 looks again like timeout in lsdasd but the output is complete
Updated by okurz almost 8 years ago
- Related to action #12106: s390 sporadic hang in bootloader_s390 on lsdasd added
Updated by okurz almost 8 years ago
- Copied to action #12182: [s390] Unknown CP command: QABOOT added
Updated by okurz almost 8 years ago
I could reproduce it manually by accident, see here:
https://w3.suse.de/~okurz/recording_s390_manual_install-2016-06-02_manual_zvm_Build1579.mkv
Updated by okurz almost 8 years ago
This should help: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/1445
After the PR is merged, we should remove the waiting time in "qaboot" and gather feedback from test runs.
Updated by okurz almost 8 years ago
ok, problem is that the following commit 7a2e1b415ca7743174836b4e0894e0dae3fc923f is the first bad commit
commit 7a2e1b415ca7743174836b4e0894e0dae3fc923f
Author: Vladimir Nadvornik <nadvornik@suse.cz>
Date: Wed May 18 09:40:37 2016 +0200
Replace threads with processes
use of perl threads is officially discouraged
this fixes https://progress.opensuse.org/issues/11816
:040000 040000 7e028197d2c68e11f7c9a8dca6c0e8b057527596 35826c8c7beb4b04ab13d1cf73f40a4aaf753e3c M backend
:100755 100755 12d7e4e5acb0b7cde9aebc9ad09216b93d0500f8 987eeae91b5f837231a360b7f9cb7bbe5d7a1114 M bmwqemu.pm
:100644 100644 e868ca4569da37914d7278feae9fe91c7bb8b816 4b73ffacdf64e7d032708b400bb96dc02783c92f M commands.pm
:100644 100644 f6f9c96520bb7c61b5a9f2b3b278fb52359e6f9b c5ab8fad9aa626c609a90f040fcd377dafab1b3f M cpanfile
:100755 100755 18024a2a366aa992437da4152c8eaecb69405214 e5622ab375ae3726b4ddd84391fbb614e4867b31 M isotovideo
:100755 100755 1c1bb217d08cab435390ad9b15f531912281bce1 ef4a3900187567da7649b166513d93faf9a78817 M testapi.pm
It is breaking the use of "confess" within consoles/s3270.pm
https://github.com/os-autoinst/os-autoinst/pull/510 and https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/1447 should fix it however the real reason why "confess" fails is unknown.
Updated by coolo almost 8 years ago
But you do realize confess is meant to be fatal, right?
Updated by okurz almost 8 years ago
yes, I know. To be more exact: "confess" dies "within itself" so not as it should, with a backtrace and then die but I guess within some kind of overwritten custom die handler or something like that.
Updated by nadvornik almost 8 years ago
Anyway, the intended behavior is that "confess" should call "die_handler" in baseclass.pm, the handler should shut down the backend correctly.
Updated by mgriessmeier almost 8 years ago
still sporadicly failing - retry seems not to work properly
see e.g. https://openqa.suse.de/tests/439591 or https://openqa.suse.de/tests/439577
Updated by okurz almost 8 years ago
- Related to action #12276: [s390] can timeout waiting for "CMS" added
Updated by okurz almost 8 years ago
- Description updated (diff)
As described in action #12276 a workaround in the tests has been applied to retry on errors on either download errors or s3270 related errors: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/1463
"QABOOT" has been simplified by removing the "random sleep time" to prevent download issues. The assumption for the sleep was that too many jobs started at the same time could cause some timeout issue. Regardless if this is true or not the issue is not fixed by that. That is why the retry is the better behavior.
It seems another issue (or maybe the real problem?) why we did not see any backtrace on "die" in the logfile is that the die-handler catches the exception but silently forwards it to the web UI showing the "wait_serial"-like failure windows including all details and the backtrace. https://github.com/os-autoinst/os-autoinst/pull/513 prevents this confusion by outputting any message, e.g. backtrace, to the logfile, first. Grepping logfiles of multiple jobs is also much easier than looking at the web UI.
Now no sporadic timeout on "#cp … cms" or downloading kernel+initrd should make the tests fail. However, the backend implementation of the s3270 console is still lacking. It could be we introduced a regression or it was not completely stable after all. mgriessmeier proposes to just replace "expect_s3270" calls by "assert_screen" on the x3270 window. okurz proposes to improve the backend console properly (needs better understanding how it really works by okurz+mgriessmeier).
Updated by okurz almost 8 years ago
@mgriessmeier: Take a look at this failure, seems slightly different but new: https://openqa.suse.de/tests/442807/file/autoinst-log.txt
It should from now on also be catched by the retry-eval but nevertheless we might want to understand this better.
Updated by okurz almost 8 years ago
- Description updated (diff)
The output for the previous test is
'4145152 bytes transferred. ',
'INPUT ',
'4153344 bytes transferred. ',
'4161536 bytes transferred. ',
'Server not responding, closing connection. ',
'Abnormal inter-VM communication condition ',
'Abnormal inter-VM communication condition ',
'>>>ABOR
so the download aborts with a new warning from probably the FTP call but this should be catched now from our high-level retry in the test.
https://openqa.suse.de/tests/443270 is failing in the "Loading installation system" step.
Seen in both is that the line "INPUT" is typed too early before the xedit mode was encountered. Then the content that should go into the file edited is instead typed too early and lost. That is why the next step "Loading installation system" is never reached.
EDIT: This error was introduced with os-autoinst#d7b93259, fixed with next PR
Updated by okurz almost 8 years ago
- Subject changed from [s390] sporadic fail in bootloader_s390 downloading kernel and initrd to [s390] sporadic fail in bootloader_s390, e.g. downloading kernel and initrd
- Description updated (diff)
After a complete code walkthrough with mgriessmeier, PR done with mgriessmeier: https://github.com/os-autoinst/os-autoinst/pull/514
The most important change is basically:
- warn "status line matches neither buffer_ready nor buffer_full:\n" . Dumper($result) . $status_line;
+ confess "expect_3270: timed out waiting for 'expected_status'.\n" . " waiting for ${\Dumper \%arg}\n" . " last output:\n" . Dumper($result) . $status_line;
i.e. a fatal error as in before which can be catched from the test code with "eval". Previously we exchanged an "confess" by "warn" on the same location as we had the feeling that this line is part of expected behaviour which was right considering that this point is reached when test code is looking for certain text in the status line which does not show up in the specified timeout. We were wrong to change it to a non-fatal warn as then this situation could not be detected anymore from the test code and therefore not properly handled. The confess message was probably implemented a long time ago and not improved when the functionality to also search for custom status line entries was introduced, i.e. not only "RUNNING" or "MORE..."
The correct behaviour from the test point of view is as implemented with the latest change in the test, i.e. retry on network errors.
Therefore rejecting H3 and accepting H2.
Updated by okurz almost 8 years ago
- Status changed from In Progress to Feedback
new backend is deployed, let's wait some more tests to see this does not come back. Also review subtasks before closing
EDIT: New backend was not deployed on all workers until today (2016-06-20), it seems. It should work now.