action #30613
closed[functional][y][hard] script_output might include extra serial output
Added by nicksinger over 6 years ago. Updated about 6 years ago.
0%
Description
Problem¶
From time to time, script_output
does not only contain the actual output of a script but also other output from the system on the serial output. This can result in random strings in the middle of scripts.
Example¶
https://openqa.suse.de/tests/1403728#step/btrfs_send_receive/48 is a nice example of this happening.
The script generates two time the same hash (serial0.txt -> d62eb43dbcd156535e2444de5fa62d8c) but then fails to compare them because in the meantime there was also the login-prompt printed to the serial console.
This then gets picked up as output from the script which is not the expected behavior.
Expected result¶
The return value from script_output
does only contain the output of the script. No additional serial output.
Suggestions¶
- disable the getty service on the serial console which seems to cause most if not all of the problems now and confirm we are not relying on the login session on the serial port and potentially extend the documentation of
script_output
accordingly - Alternative: let
script_output
use a second channel to communicate its result back which is only used by os-autoinst and not shared with the system-logging. E.g. a second serial connection.
Updated by nicksinger over 6 years ago
- Blocks action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added
Updated by okurz over 6 years ago
- Category set to Feature requests
I think there was something that mudler was working on but could not find it right now. For now script_output works as designed: It does not promise to return only what was expected so the test code needs to adapt correspondingly. We did this successfully in the past and I think it should be possible in the case you linked as well.
Updated by okurz over 6 years ago
- Blocks deleted (action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable)
Updated by okurz over 6 years ago
- Related to action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added
Updated by nicksinger over 6 years ago
Several reasons to object:
- Hard to review
- Confirms the general opinion directed to openQA: "it's not reliable after all"
- Undocumented behavior
- Wrongly documented: https://github.com/os-autoinst/openQA/blob/master/docs/WritingTests.asciidoc#using-text-consoles-and-the-serial-terminal ("However you may find it more convenient, readable or reliable to do it in the Perl test module.")
Setting it back as blocker.
Updated by nicksinger over 6 years ago
- Related to deleted (action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable)
Updated by nicksinger over 6 years ago
- Blocks action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added
Updated by coolo over 6 years ago
- Subject changed from script_output generates unreliable script outputs to script_output might include extra serial output
- Priority changed from High to Normal
- Target version set to Ready
Updated by okurz over 6 years ago
An alternative to the proposed solution using a dedicated serial line could be what we are already doing in test workarounds, e.g. see https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4327/files , that is
using special markers in the output that is looked for in the serial port output, e.g.
sub script_output_safe {
my ($cmd, %args) = @_;
my $out = script_output "echo '>>>'$($cmd)'<<<'";
my $expected_output;
if ($output =~ />>>(?<expected_output>.+)<<</) {
$expected_output = $+{expected_output};
}
return $expected_output;
}
or maybe even more simple:
sub script_output_safe {
my ($cmd, %args) = @_;
my $out = script_output "echo '>>>'$($cmd)'<<<'";
return $+ if $output =~ />>>(?<expected_output>.+)<<</;
}
Updated by okurz over 6 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Updated by JERiveraMoya over 6 years ago
- Related to action #26112: [sle][functional] test fails in zypper_lifecycle - bash script fails added
Updated by riafarov over 6 years ago
- Blocks deleted (action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable)
Updated by riafarov over 6 years ago
- Related to action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added
Updated by michalnowak over 6 years ago
This one just blows my mind: https://openqa.suse.de/tests/1458433#step/apache_nss/37
# wait_serial expected: <html><h2>HelloLinux</h2></html>
# Result:
<html><h2>HelloLinux</h2></html>
Updated by riafarov over 6 years ago
So, unfortunately now we have even more complicated things on xen pv: https://openqa.suse.de/tests/1483693#step/zypper_lifecycle/29 and https://openqa.suse.de/tests/1483677#step/zypper_lifecycle/42
For the first one, we can see output on the screen: https://openqa.suse.de/tests/1483693#step/zypper_lifecycle/27 but when we got it on serial device: https://openqa.suse.de/tests/1483693#step/zypper_lifecycle/28 letter 'e' is missing. So we it's not extra serial output, but seems that we cannot rely on serial device on xen pv.
Same happened here: https://openqa.suse.de/tests/1483677#step/zypper_lifecycle/41 , but then here https://openqa.suse.de/tests/1483677#step/zypper_lifecycle/42 where b is missing
Updated by michalnowak over 6 years ago
Both of those cases are on Hyper-V not Xen PV. And that's kinda expected on Hyper-V, because on Hyper-V serial console is inherently unreliable, it's leaking characters, especially for strings longer that 16 characters. zypper_lifecycle
is written in such a way that workarounding it for Hyper-V means to re-write it. I'll just disable it there: https://progress.opensuse.org/issues/32029.
Updated by rpalethorpe over 6 years ago
This is solved by using the Serial Terminal Console (VIRTIO_CONSOLE) which uses a separate serial port in QEMU, but it is an entirely different mode of communicating. See the new docs here: https://github.com/os-autoinst/openQA/pull/1586
It would be quite reasonable to refactor the VNC console code so that it uses the virtio_screen.pm code to read emulated serial terminal output (serial0.txt). Then when the virtio serial port is available it could use that instead of the emulated serial port (where the kernel log usually goes). So input would still be sent over VNC, but the output would be sent to a separate serial port as suggested by Nick and would be processed more efficiently. We could also modify wait_serial so that the user can 'hint' which serial port they want, because sometimes they may want to read something sent to the kernel log.
After that is done, it would also be possible to send text input over the serial terminal console (the virtio port in QEMU), but have it graphically displayed in VNC by running some software on the SUT (possibly just screen). This would make VNC almost as fast as a pure serial terminal, but still allow needles to be used, which would be very useful with TUIs like Yast. Although it might be better to contribute to VNC by implementing copy & paste properly.
This should also apply to other consoles which are using SSH, IPMI or physical serial port(s).
Currently I am refactoring the script_output code to move it to distribution.pm to help fix a race condition.
Updated by riafarov over 6 years ago
- Is duplicate of action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksum added
Updated by riafarov over 6 years ago
- Is duplicate of deleted (action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksum)
Updated by riafarov over 6 years ago
- Related to action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksum added
Updated by okurz over 6 years ago
- Status changed from In Progress to Feedback
So my proposed change is live and deployed but does not prevent the problems as intended. In https://openqa.suse.de/tests/1483794/file/serial0.txt we can see
t9vXJ
c2066953541fa467dac88003fb3f9896 -
SCRIPT_FINISHEDt9vXJ-0-
showing the begin and end marker
but then we have
Welcome to SUSE Linux Enterprise Server 15 Beta7 (ppc64le) - Kernel 4.12.14-12.3-default (hvc0).
eth0: 10.0.2.15 fec0::14fb:1a86:faa9:82c4
susetest login: c2066953541fa467dac88003fb3f9896 -
SCRIPT_FINISHEDt9vXJ-0-
@rpalethorpe did some cool changes. How can we reevaluate the current situation? Everyone that observes problems in script_output please mention URLs here and describe the symptoms.
Updated by rpalethorpe over 6 years ago
Even after my changes to script_output there still appears to be some issue on ARM with cat: https://openqa.suse.de/tests/1534756. We should see 'cat /tmp/ver_linux_before.txt' after calling 'cat - > /tmp/scriptORQ2Z.sh; echo ORQ2Z-$?-', but it is totally missing. There is probably some race condition here. Possibly we could wait for the SUT to echo the new line character.
Updated by okurz over 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: textmode+role_textmode
https://openqa.suse.de/tests/1570712
Updated by okurz over 6 years ago
- Related to action #33958: [sle][functional][u][fast] test fails in zypper_lifecycle - 'end of support' line not found added
Updated by okurz over 6 years ago
Another idea I am having: Why not just disable the getty service on the serial console which seems to cause most if not all of the problems now? We are not making use of it during the normal tests anyway, right? I would not consider it a workaround because you can say that script_output
currently expects that the serial terminal used is expecting it to be reserved and not shared. We can extend the script_output documentation of course.
Updated by okurz over 6 years ago
- Related to action #29637: [tools][kernel] test suddenly print login words -> script_output used wrong added
Updated by okurz over 6 years ago
- Subject changed from script_output might include extra serial output to [functional][u]script_output might include extra serial output
- Due date set to 2018-05-22
- Status changed from Feedback to Workable
- Assignee deleted (
okurz) - Target version changed from Ready to Milestone 16
see #30613#note-28
Updated by riafarov over 6 years ago
- Subject changed from [functional][u]script_output might include extra serial output to [functional][u][hard] script_output might include extra serial output
NOTE: ticket includes just trying out solution from comment #28 and not completely new solution.
Updated by okurz over 6 years ago
- Due date changed from 2018-05-22 to 2018-06-05
I don't see it being done this sprint, let's delay it for one sprint.
Updated by SLindoMansilla over 6 years ago
- Due date changed from 2018-06-05 to 2018-06-19
Not enough capacity during sprint 18
Updated by okurz over 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: gnome+proxy_SCC+allmodules
https://openqa.suse.de/tests/1748781
Updated by okurz over 6 years ago
- Target version changed from Milestone 16 to Milestone 17
we are in M17 by now
Updated by okurz over 6 years ago
- Target version changed from Milestone 17 to Milestone 17
Updated by mgriessmeier over 6 years ago
- Subject changed from [functional][u][hard] script_output might include extra serial output to [functional][y][hard] script_output might include extra serial output
- Due date changed from 2018-06-19 to 2018-07-03
revisit in planning to help everyone understand what's the next step to do
Updated by okurz over 6 years ago
- Description updated (diff)
So we had been shifting this ticket around a lot :) Let's try one more sprint on [y] otherwise we shift to "future".
Updated by riafarov over 6 years ago
- Status changed from Workable to In Progress
- Assignee set to riafarov
Updated by riafarov over 6 years ago
- Status changed from In Progress to Resolved
So idea with getty didn't work well, getty is not running against serial console and all what we see is duplicated output which we get for other ttys. So, I guess the only way to solve this issue completely is to use second serial device which will be dedicated for script_output.
@okurz, I suggest removing this ticket from the sprint or we can resolve it. Feel free to reschedule. I also believe we have mitigated issue already with workarounds, but it's still is the reason of some sporadic failures. So we can consider applying second suggestion, which will be tricky for some backends (e.g. ipmi).
Updated by riafarov over 6 years ago
- Due date deleted (
2018-07-03) - Status changed from Resolved to Workable
- Assignee deleted (
riafarov)
Updated by okurz over 6 years ago
riafarov wrote:
So idea with getty didn't work well, getty is not running against serial console and all what we see is duplicated output which we get for other ttys
Can you reference tests you ran that support this expectation? I do not understand how "duplicated output for other ttys" could show up on the serial port.
I failed to find a recently failing job on this with a quick search. The one mentioned in #30613#note-34 is actually not this ticket, the failure here is not related to "unexpected serial output" so please do not label with a progress ticket. It looks like either the lifecycle data has not been updated or we need to update the test module to cope with EOL reports on the legacy module. I suggest to clarify in a bug report.
Updated by riafarov over 6 years ago
@okurz As you may remember why original idea didn't work, because when we set tty boot parameters we have, that serial device also is used as /dev/console and hence gets everything we also see on tty we use.
As for runs, getty service is started for each tty we activate e.g. getty@tty1, but there is no getty service started against our serial console (ttyS0 on qemu). So, please, let me know if you meant something else with the idea of disabling getty.
Updated by okurz over 6 years ago
hm, if there is not getty running on $serialdev what provides the login prompt then? Or is it really because of /dev/console? I doubt it. Btw, have we ever seen it on x86_64 or only aarch64+ppc64le?
Updated by riafarov over 6 years ago
- Assignee set to riafarov
It mostly affects ppc64. We get login prompt for tty which we select, and we don't switch to ttyS0. Let me double check it. I've taken a look on my runs and there command failed.
Updated by riafarov over 6 years ago
So, here is the run I did: http://g226.suse.de/tests/1997#step/consoletest_setup/27 but as you can see on serial, it doesn't really help for ppc64le and we still get login prompt on serial:
http://g226.suse.de/tests/1997/file/serial0.txt
For x86 I got same number of "Welcome" messages on serial as without stopped getty@ttyS0, so my quick guess is that it's related to /dev/console, but obviously I can be wrong.
Updated by pvorel over 6 years ago
riafarov wrote:
It mostly affects ppc64. We get login prompt for tty which we select, and we don't switch to ttyS0. Let me double check it. I've taken a look on my runs and there command failed.
As you talk about ppc64, ttyS0
and /dev/console
, this is probably not related, but just in case you don't know that:
So on ppc64 it's hvc0
instead of ttyS0
.
There is actually comment in this ticket proving this:
Welcome to SUSE Linux Enterprise Server 15 Beta7 (ppc64le) - Kernel 4.12.14-12.3-default (hvc0).
OT: We had to do some quirks for LTP on ppc64 running with VIRTIO_CONSOLE we had to add hvc1
(#18980 https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L259).
Updated by riafarov over 6 years ago
- Status changed from Workable to Feedback
Updated by pvorel over 6 years ago
FYI: We suffer occasional problems with LTP since this commit:
b3ee11ed LTP: Merge ltp_setup_networking.pm into install_ltp.pm
https://openqa.opensuse.org/tests/698252#step/install_ltp/141
AFAIK we never had any problem on x86_64 before. I'll test your PR.
Updated by okurz over 6 years ago
But this LTP is not even using the serial device but the virtio based serial terminal. I hardly doubt that is related.
Updated by riafarov over 6 years ago
@pvorel Hmm, PR I've created hasn't been yet merged. Can it be related to workers performance?
Updated by riafarov over 6 years ago
- Due date changed from 2018-07-03 to 2018-07-31
Updated by pvorel over 6 years ago
okurz wrote:
But this LTP is not even using the serial device but the virtio based serial terminal. I hardly doubt that is related.
OK, I'm sorry for unrelated comment.
riafarov wrote:
@pvorel Hmm, PR I've created hasn't been yet merged. Can it be related to workers performance?
I have no idea.
Updated by okurz over 6 years ago
So now I wonder if we still need any workarounds, I proposed https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5379 to remove workarounds. Would you like to test the PR?
Updated by riafarov about 6 years ago
@okurz, we have discussed a problem of changing scope of the ticket, in this case I would prefer resolving this one and having separate ticket for removing other workarounds. On top, we have only mitigated an issue, so other workarounds still may be useful.
Updated by riafarov about 6 years ago
- Related to action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymore added
Updated by mgriessmeier about 6 years ago
- Status changed from Resolved to In Progress
This needs to be rediscussed, since it broke s390 and svirt reboots due to masking the serialdev service in the creation job and not unmasking it again.
see e.g. https://openqa.suse.de/tests/1838736
we've reverted the PR with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5427/
Updated by okurz about 6 years ago
- Target version changed from Milestone 17 to Milestone 18
@riafarov feel free to unassign after we reopened now but please restate your explanation from the revert PR why we broke it
Updated by okurz about 6 years ago
- Related to action #37027: [tw][desktop] Update create_hdd_* to make sure right permissions added
Updated by okurz about 6 years ago
- Related to action #38480: [functional][u][fast][qam] job mau-extratests fails in new tests update_alternatives, aplay, openssl_alpn, curl_ipv6, wget_ipv6, gpg, shells added
Updated by riafarov about 6 years ago
- Due date changed from 2018-07-31 to 2018-08-14
E.g. split this part to ensure_serial_console_permissions.
Updated by riafarov about 6 years ago
- Related to deleted (action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymore)
Updated by riafarov about 6 years ago
- Blocks action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymore added
Updated by riafarov about 6 years ago
- Status changed from In Progress to Feedback
We are waiting when PR to split consoletest_setup will get merged and then make a decision where to add part to stop service. As mentioned above, we should have it enabled when reboot on svirt and zVM, as read serial to detect grub, welcome screens.
Updated by riafarov about 6 years ago
Updated by riafarov about 6 years ago
- Related to action #39665: [functional][u][ipmi] test fails in system_prepare, /dev/sshserial is used instead of /dev/ttyS1 added
Updated by riafarov about 6 years ago
- Due date changed from 2018-08-14 to 2018-08-28
Updated by riafarov about 6 years ago
Broke this one: https://openqa.suse.de/tests/1936687#step/consoletest_setup/29
Updated by riafarov about 6 years ago
- Status changed from Feedback to Resolved
Failure is not related to the change, but to the worker performance, we have quite some sporadic failures on it. Here is the run with less jobs running on it: https://openqa.suse.de/tests/1945118#step/consoletest_setup/19
This issues will be addressed in separate tickets.
Updated by cfconrad about 6 years ago
Created a new ticket [1] for the cases where script_output() fail in "cat -" and never return to prompt.
Which was mentioned in note-23 [2] but seems to be a different issue.
[1] https://progress.opensuse.org/issues/40913
[2] https://progress.opensuse.org/issues/30613#note-23