action #12410
s390 dasdfmt fails even though command looks complete in screenshot
0%
Description
observation¶
https://openqa.suse.de/tests/448619 dasdfmt seems to be done but the wait_serial exited, maybe it was taking unusually long?
steps to reproduce¶
TBC
problem¶
happened during the last week sporadically
H1. REJECTED: worker/s390-host specific
H2. can happen everywhere
H3. our recent changes in bootloader_s390 introduced some behaviour change
H4. serial output gets lost
H4.1. REJECTED: output to serial gets lost randomly -> 1000/1000 runs of assert_script_run("echo $_", 10, 'failed');
succeeded, see http://opeth.suse.de/tests/2825
H4.2. REJECTED: long timeouts cause serial output loss -> 10/10 runs of assert_script_run("sleep 900 && echo $_", 1200, 'failed');
succeeded, see http://opeth.suse.de/tests/2825 and http://opeth.suse.de/tests/2866
H4.3. UNCLEAR: serial output only gets lost when dasdfmt is called with assert_script_run -> not reproducable at all on lord.arch, maybe E3-1 and E4-1 are invalid therefore
H4.4: iucvconn and agetty processes are not running, <- most likely, since we can see this in our debug output
suggestion¶
- check logfiles, e.g. for exact timing sequence -> wait_serial times out after 20 minutes in both occassions. from video we can see that the actual formatting process was finished already in before
- E1-1. DONE: reproduce by calling the dasdfmt repeatedly on another host (my host (okurz)) -> done, could not reproduce in http://lord.arch/tests/1582 in 17/17 runs of full dasdfmt on personal instance
- E2-1. DONE: find out if problem only occurs on some or a single host -> found 3 different hosts with this issue
- E3-1. DONE: mgriessmeier: find old test run before we deployed new backend that shows this error -> none found
- E4-1. DONE: mgriessmeier: find s390x host with small disk (to save time) and format many times, i.e. call for-loop with the assert_script_run on dasdfmt -> could not reproduce
workaround¶
sporadic, restart
Related issues
History
#1
Updated by okurz over 6 years ago
also happened in https://openqa.suse.de/tests/448532
#2
Updated by okurz over 6 years ago
- Description updated (diff)
special test run started for reproduction of hypotheses in
http://opeth.suse.de/tests/2825
we use
# Test if randomly output to serial gets lost for (1..1_000) { assert_script_run("echo $_", 10, 'failed'); } # Test if long timeouts cause serial output loss for (1..10) { assert_script_run("sleep 900 && echo $_", 1200, 'failed'); }
#3
Updated by okurz over 6 years ago
- Description updated (diff)
#4
Updated by okurz over 6 years ago
- Description updated (diff)
- Assignee changed from okurz to mgriessmeier
experiment finished in http://lord.arch/tests/1582, build timed out after 2h, 17/17 succeeded
mgriessmeier, conduct missing experiments
#5
Updated by okurz over 6 years ago
recent example on Build1636: https://openqa.suse.de/tests/454616 failed for the same reason, on openqaw1:1. Previous tests triggered from lord.arch couldn't reproduce this but I have the free ressource so I am making use of it: http://lord.arch/tests/1593 and following with test "crosscheck_poo#12410@s390x-zVM-vswitch-l3"
… but it does not work, always problems to connect over ssh, don't know why.
#6
Updated by okurz over 6 years ago
I am trying once more to reproduce this, this time based on the most recent failing example in Build 1648.
Triggered as http://lord.arch/tests/1825 and 20 following.
#7
Updated by okurz over 6 years ago
- Related to action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lost added
#8
Updated by okurz over 6 years ago
- Related to action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes) added
#9
Updated by mgriessmeier over 6 years ago
- Description updated (diff)
#10
Updated by mgriessmeier over 6 years ago
- Description updated (diff)
#11
Updated by okurz over 6 years ago
- Related to action #12300: [s390] can fail during formatting/wait_serial added
#12
Updated by okurz over 6 years ago
- Related to deleted (action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes))
#13
Updated by okurz over 6 years ago
- Blocks action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes) added
#14
Updated by okurz over 6 years ago
- Related to deleted (action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lost)
#15
Updated by okurz over 6 years ago
- Blocked by action #12596: s390: wait serial output in "logpackages" and "consoletest_setup" is lost added
#16
Updated by mgriessmeier over 6 years ago
- Description updated (diff)
latest failing example:
https://openqa.suse.de/tests/539995
if you compare this step https://openqa.suse.de/tests/539995#step/bootloader_s390/8
to the same step of a passed job, e.g. https://openqa.suse.de/tests/538665#step/bootloader_s390/8
you see that somehow the iucvconn and the agetty were killed/not yet established which results in the wait_serial issue
#17
Updated by okurz over 6 years ago
- Blocks deleted (action #12452: [s390x] mysql_srv: wait_serial expects regexp, serial0.log shows the right match, but test fails (timeout too short sometimes))
#18
Updated by mgriessmeier over 6 years ago
- Status changed from New to Feedback
not seen for a long time, considering as fixed
#19
Updated by mgriessmeier over 6 years ago
- Status changed from Feedback to Resolved