Project

General

Profile

Actions

action #12410

closed

s390 dasdfmt fails even though command looks complete in screenshot

Added by okurz almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2016-06-20
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

observation

https://openqa.suse.de/tests/448619 dasdfmt seems to be done but the wait_serial exited, maybe it was taking unusually long?

steps to reproduce

TBC

problem

happened during the last week sporadically
H1. REJECTED: worker/s390-host specific
H2. can happen everywhere
H3. our recent changes in bootloader_s390 introduced some behaviour change
H4. serial output gets lost
H4.1. REJECTED: output to serial gets lost randomly -> 1000/1000 runs of assert_script_run("echo $_", 10, 'failed'); succeeded, see http://opeth.suse.de/tests/2825
H4.2. REJECTED: long timeouts cause serial output loss -> 10/10 runs of assert_script_run("sleep 900 && echo $_", 1200, 'failed'); succeeded, see http://opeth.suse.de/tests/2825 and http://opeth.suse.de/tests/2866
H4.3. UNCLEAR: serial output only gets lost when dasdfmt is called with assert_script_run -> not reproducable at all on lord.arch, maybe E3-1 and E4-1 are invalid therefore
H4.4: iucvconn and agetty processes are not running, <- most likely, since we can see this in our debug output

suggestion

  • check logfiles, e.g. for exact timing sequence -> wait_serial times out after 20 minutes in both occassions. from video we can see that the actual formatting process was finished already in before
  • E1-1. DONE: reproduce by calling the dasdfmt repeatedly on another host (my host (okurz)) -> done, could not reproduce in http://lord.arch/tests/1582 in 17/17 runs of full dasdfmt on personal instance
  • E2-1. DONE: find out if problem only occurs on some or a single host -> found 3 different hosts with this issue
  • E3-1. DONE: @mgriessmeier: find old test run before we deployed new backend that shows this error -> none found
  • E4-1. DONE: @mgriessmeier: find s390x host with small disk (to save time) and format many times, i.e. call for-loop with the assert_script_run on dasdfmt -> could not reproduce

workaround

sporadic, restart


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #12300: [s390] can fail during formatting/wait_serialResolvedokurz2016-05-24

Actions
Blocked by openQA Tests - action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lostResolvedmgriessmeier2016-07-04

Actions
Actions #1

Updated by okurz almost 8 years ago

Actions #2

Updated by okurz almost 8 years ago

  • Description updated (diff)

special test run started for reproduction of hypotheses in
http://opeth.suse.de/tests/2825

we use

    # Test if randomly output to serial gets lost
    for (1..1_000) {
        assert_script_run("echo $_", 10, 'failed');
    }

    # Test if long timeouts cause serial output loss
    for (1..10) {
        assert_script_run("sleep 900 && echo $_", 1200, 'failed');
    }
Actions #3

Updated by okurz almost 8 years ago

  • Description updated (diff)
Actions #4

Updated by okurz almost 8 years ago

  • Description updated (diff)
  • Assignee changed from okurz to mgriessmeier

experiment finished in http://lord.arch/tests/1582, build timed out after 2h, 17/17 succeeded

@mgriessmeier, conduct missing experiments

Actions #5

Updated by okurz almost 8 years ago

recent example on Build1636: https://openqa.suse.de/tests/454616 failed for the same reason, on openqaw1:1. Previous tests triggered from lord.arch couldn't reproduce this but I have the free ressource so I am making use of it: http://lord.arch/tests/1593 and following with test "crosscheck_poo#12410@s390x-zVM-vswitch-l3"
… but it does not work, always problems to connect over ssh, don't know why.

Actions #6

Updated by okurz over 7 years ago

I am trying once more to reproduce this, this time based on the most recent failing example in Build 1648.
Triggered as http://lord.arch/tests/1825 and 20 following.

Actions #7

Updated by okurz over 7 years ago

  • Related to action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lost added
Actions #8

Updated by okurz over 7 years ago

  • Related to action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes) added
Actions #9

Updated by mgriessmeier over 7 years ago

  • Description updated (diff)
Actions #10

Updated by mgriessmeier over 7 years ago

  • Description updated (diff)
Actions #11

Updated by okurz over 7 years ago

  • Related to action #12300: [s390] can fail during formatting/wait_serial added
Actions #12

Updated by okurz over 7 years ago

  • Related to deleted (action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes))
Actions #13

Updated by okurz over 7 years ago

  • Blocks action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes) added
Actions #14

Updated by okurz over 7 years ago

  • Related to deleted (action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lost)
Actions #15

Updated by okurz over 7 years ago

  • Blocked by action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lost added
Actions #16

Updated by mgriessmeier over 7 years ago

  • Description updated (diff)

latest failing example:
https://openqa.suse.de/tests/539995

if you compare this step https://openqa.suse.de/tests/539995#step/bootloader_s390/8
to the same step of a passed job, e.g. https://openqa.suse.de/tests/538665#step/bootloader_s390/8
you see that somehow the iucvconn and the agetty were killed/not yet established which results in the wait_serial issue

Actions #17

Updated by okurz over 7 years ago

  • Blocks deleted (action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes))
Actions #18

Updated by mgriessmeier over 7 years ago

  • Status changed from New to Feedback

not seen for a long time, considering as fixed

Actions #19

Updated by mgriessmeier over 7 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF