Project

General

Profile

Actions

action #10584

closed

[s390] sporadic fail in bootloader_s390, e.g. downloading kernel and initrd

Added by mgriessmeier over 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2016-01-15
Due date:
% Done:

100%

Estimated time:

Description

observation

tests are still failing sometimes in bootloader_s390 due to unexpected behaviour when downloading kernel and initrd...

steps to reproduce

  • "clone_job.pl" on any s390x@zVM job often enough, at least 20 times, to see the issue

problem

H1. REJECTED: too many jobs started at once cause some timeout on download --> random back-off time in qaboot did not remedy all problems, reverted changes
H2. ACCEPTED: Network download can stall some times and we can't control this --> ~~undecided if this is true, still ~~ "retry" as workaround applied, see #10584#note-18
H3. REJECTED: Network download is just slower sometimes, causing progress notification to be output more often, causing one buffer window of x3270 to fill up and the function expect_s3270 in the backend is not handling this correctly. --> expect_s3270 was behaving correctly just having wrong comments and misleading output, see #10584#note-21

suggestion

  • some tasks have been tried, workarounds applied, what is left to do:
  • DONE: Fix handling of "status line matches neither buffer_ready nor buffer_full:" in expect_s3270 to accept either H3 or H2 and reject the other -> expect_s3270 was actually fine but the message "status line matches neither buffer_ready nor buffer_full" was very confusing and improved in gh-osautoinst#514

further details

the former ftpboot fix (#10224) seems to be not the only issue with that
so i will investigate this further


Related issues 5 (0 open5 closed)

Related to openQA Tests - action #12106: s390 sporadic hang in bootloader_s390 on lsdasdResolvedokurz2016-05-24

Actions
Related to openQA Tests - action #12276: [s390] can timeout waiting for "CMS"Resolvedokurz2016-06-09

Actions
Has duplicate openQA Tests - action #11910: s390 bootloader fails in https://openqa.suse.de/tests/383606Closedokurz2016-05-10

Actions
Copied from openQA Project - action #10224: [s390] modify ftpboot to prevent tests from failing sometimes when multiple tests are started at onceResolvedmgriessmeier2016-01-152016-01-15

Actions
Copied to openQA Tests - action #12182: [s390] Unknown CP command: QABOOTResolved2016-01-15

Actions
Actions #1

Updated by mgriessmeier over 8 years ago

  • Copied from action #10224: [s390] modify ftpboot to prevent tests from failing sometimes when multiple tests are started at once added
Actions #2

Updated by okurz over 8 years ago

  • Subject changed from [s390] tests are still failing sometimes in bootloader_s390 due to unexpected behaviour when downloading kernel and initrd... to [s390] sporadic fail in bootloader_s390 downloading kernel and initrd
  • Description updated (diff)
Actions #3

Updated by mgriessmeier about 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Issue was not seen for a long time after the first workaround, so the assumption is that this was a performance issue on our mainframe
-> closing for now

Actions #4

Updated by mgriessmeier about 8 years ago

  • Status changed from Resolved to In Progress

reopened due to recent fails...

--> adjusted the random wait time for establishing a ftp connection in ftpboot to 90seconds

Actions #5

Updated by okurz about 8 years ago

  • Has duplicate action #11910: s390 bootloader fails in https://openqa.suse.de/tests/383606 added
Actions #6

Updated by okurz almost 8 years ago

recent example: https://openqa.suse.de/tests/396484

where can we mangle with the ftpboot stuff? Or can we just retry from the backend or tests?

Actions #7

Updated by okurz almost 8 years ago

https://openqa.suse.de/tests/400294/modules/bootloader_s390/steps/4 looks again like timeout in lsdasd but the output is complete

Actions #8

Updated by okurz almost 8 years ago

  • Related to action #12106: s390 sporadic hang in bootloader_s390 on lsdasd added
Actions #9

Updated by okurz almost 8 years ago

  • Copied to action #12182: [s390] Unknown CP command: QABOOT added
Actions #11

Updated by okurz almost 8 years ago

This should help: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/1445

After the PR is merged, we should remove the waiting time in "qaboot" and gather feedback from test runs.

Actions #12

Updated by okurz almost 8 years ago

ok, problem is that the following commit 7a2e1b415ca7743174836b4e0894e0dae3fc923f is the first bad commit

commit 7a2e1b415ca7743174836b4e0894e0dae3fc923f
Author: Vladimir Nadvornik <nadvornik@suse.cz>
Date:   Wed May 18 09:40:37 2016 +0200

    Replace threads with processes

    use of perl threads is officially discouraged
    this fixes https://progress.opensuse.org/issues/11816

:040000 040000 7e028197d2c68e11f7c9a8dca6c0e8b057527596 35826c8c7beb4b04ab13d1cf73f40a4aaf753e3c M      backend
:100755 100755 12d7e4e5acb0b7cde9aebc9ad09216b93d0500f8 987eeae91b5f837231a360b7f9cb7bbe5d7a1114 M      bmwqemu.pm
:100644 100644 e868ca4569da37914d7278feae9fe91c7bb8b816 4b73ffacdf64e7d032708b400bb96dc02783c92f M      commands.pm
:100644 100644 f6f9c96520bb7c61b5a9f2b3b278fb52359e6f9b c5ab8fad9aa626c609a90f040fcd377dafab1b3f M      cpanfile
:100755 100755 18024a2a366aa992437da4152c8eaecb69405214 e5622ab375ae3726b4ddd84391fbb614e4867b31 M      isotovideo
:100755 100755 1c1bb217d08cab435390ad9b15f531912281bce1 ef4a3900187567da7649b166513d93faf9a78817 M      testapi.pm

It is breaking the use of "confess" within consoles/s3270.pm

https://github.com/os-autoinst/os-autoinst/pull/510 and https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/1447 should fix it however the real reason why "confess" fails is unknown.

Actions #13

Updated by coolo almost 8 years ago

But you do realize confess is meant to be fatal, right?

Actions #14

Updated by okurz almost 8 years ago

yes, I know. To be more exact: "confess" dies "within itself" so not as it should, with a backtrace and then die but I guess within some kind of overwritten custom die handler or something like that.

Actions #15

Updated by nadvornik almost 8 years ago

Anyway, the intended behavior is that "confess" should call "die_handler" in baseclass.pm, the handler should shut down the backend correctly.

Actions #16

Updated by mgriessmeier almost 8 years ago

still sporadicly failing - retry seems not to work properly
see e.g. https://openqa.suse.de/tests/439591 or https://openqa.suse.de/tests/439577

Actions #17

Updated by okurz almost 8 years ago

  • Related to action #12276: [s390] can timeout waiting for "CMS" added
Actions #18

Updated by okurz almost 8 years ago

  • Description updated (diff)

As described in action #12276 a workaround in the tests has been applied to retry on errors on either download errors or s3270 related errors: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/1463

"QABOOT" has been simplified by removing the "random sleep time" to prevent download issues. The assumption for the sleep was that too many jobs started at the same time could cause some timeout issue. Regardless if this is true or not the issue is not fixed by that. That is why the retry is the better behavior.

It seems another issue (or maybe the real problem?) why we did not see any backtrace on "die" in the logfile is that the die-handler catches the exception but silently forwards it to the web UI showing the "wait_serial"-like failure windows including all details and the backtrace. https://github.com/os-autoinst/os-autoinst/pull/513 prevents this confusion by outputting any message, e.g. backtrace, to the logfile, first. Grepping logfiles of multiple jobs is also much easier than looking at the web UI.

Now no sporadic timeout on "#cp … cms" or downloading kernel+initrd should make the tests fail. However, the backend implementation of the s3270 console is still lacking. It could be we introduced a regression or it was not completely stable after all. mgriessmeier proposes to just replace "expect_s3270" calls by "assert_screen" on the x3270 window. okurz proposes to improve the backend console properly (needs better understanding how it really works by okurz+mgriessmeier).

Actions #19

Updated by okurz almost 8 years ago

@mgriessmeier: Take a look at this failure, seems slightly different but new: https://openqa.suse.de/tests/442807/file/autoinst-log.txt
It should from now on also be catched by the retry-eval but nevertheless we might want to understand this better.

Actions #20

Updated by okurz almost 8 years ago

  • Description updated (diff)

The output for the previous test is

 '4145152 bytes transferred.                                                      ',
'INPUT                                                                           ',
'4153344 bytes transferred.                                                      ',
'4161536 bytes transferred.                                                      ',
'Server not responding, closing connection.                                      ',
'Abnormal inter-VM communication condition                                       ',
'Abnormal inter-VM communication condition                                       ',
'>>>ABOR                                                      

so the download aborts with a new warning from probably the FTP call but this should be catched now from our high-level retry in the test.

https://openqa.suse.de/tests/443270 is failing in the "Loading installation system" step.

Seen in both is that the line "INPUT" is typed too early before the xedit mode was encountered. Then the content that should go into the file edited is instead typed too early and lost. That is why the next step "Loading installation system" is never reached.

EDIT: This error was introduced with os-autoinst#d7b93259, fixed with next PR

Actions #21

Updated by okurz almost 8 years ago

  • Subject changed from [s390] sporadic fail in bootloader_s390 downloading kernel and initrd to [s390] sporadic fail in bootloader_s390, e.g. downloading kernel and initrd
  • Description updated (diff)

After a complete code walkthrough with mgriessmeier, PR done with mgriessmeier: https://github.com/os-autoinst/os-autoinst/pull/514

The most important change is basically:

-                warn "status line matches neither buffer_ready nor buffer_full:\n" . Dumper($result) . $status_line;
+                confess "expect_3270: timed out waiting for 'expected_status'.\n" . "  waiting for ${\Dumper \%arg}\n" . "  last output:\n" . Dumper($result) . $status_line;

i.e. a fatal error as in before which can be catched from the test code with "eval". Previously we exchanged an "confess" by "warn" on the same location as we had the feeling that this line is part of expected behaviour which was right considering that this point is reached when test code is looking for certain text in the status line which does not show up in the specified timeout. We were wrong to change it to a non-fatal warn as then this situation could not be detected anymore from the test code and therefore not properly handled. The confess message was probably implemented a long time ago and not improved when the functionality to also search for custom status line entries was introduced, i.e. not only "RUNNING" or "MORE..."

The correct behaviour from the test point of view is as implemented with the latest change in the test, i.e. retry on network errors.

Therefore rejecting H3 and accepting H2.

Actions #22

Updated by okurz almost 8 years ago

  • Status changed from In Progress to Feedback

new backend is deployed, let's wait some more tests to see this does not come back. Also review subtasks before closing

EDIT: New backend was not deployed on all workers until today (2016-06-20), it seems. It should work now.

Actions #23

Updated by okurz almost 8 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF