Project

General

Profile

action #30613

[functional][y][hard] script_output might include extra serial output

Added by nicksinger over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
SUSE QA - Milestone 18
Start date:
2018-01-22
Due date:
2018-08-28
% Done:

0%

Estimated time:
13.00 h
Difficulty:

Description

Problem

From time to time, script_output does not only contain the actual output of a script but also other output from the system on the serial output. This can result in random strings in the middle of scripts.

Example

https://openqa.suse.de/tests/1403728#step/btrfs_send_receive/48 is a nice example of this happening.
The script generates two time the same hash (serial0.txt -> d62eb43dbcd156535e2444de5fa62d8c) but then fails to compare them because in the meantime there was also the login-prompt printed to the serial console.
This then gets picked up as output from the script which is not the expected behavior.

Expected result

The return value from script_output does only contain the output of the script. No additional serial output.

Suggestions

  • disable the getty service on the serial console which seems to cause most if not all of the problems now and confirm we are not relying on the login session on the serial port and potentially extend the documentation of script_output accordingly
  • Alternative: let script_output use a second channel to communicate its result back which is only used by os-autoinst and not shared with the system-logging. E.g. a second serial connection.

Related issues

Related to openQA Tests - action #26112: [sle][functional] test fails in zypper_lifecycle - bash script failsRejected2017-10-172018-02-13

Related to openQA Tests - action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliableResolved2017-12-202018-02-13

Related to openQA Tests - action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksumResolved2017-10-17

Related to openQA Tests - action #33958: [sle][functional][u][fast] test fails in zypper_lifecycle - 'end of support' line not foundResolved2018-03-282018-04-10

Related to openQA Tests - action #29637: [tools][kernel] test suddenly print login words -> script_output used wrongResolved2017-12-20

Related to openQA Tests - action #37027: [tw][desktop] Update create_hdd_* to make sure right permissionsResolved2018-06-11

Related to openQA Tests - action #38480: [functional][u][fast][qam] job mau-extratests fails in new tests update_alternatives, aplay, openssl_alpn, curl_ipv6, wget_ipv6, gpg, shellsResolved2018-07-172018-07-31

Related to openQA Tests - action #39665: [functional][u][ipmi] test fails in system_prepare, /dev/sshserial is used instead of /dev/ttyS1Resolved2018-08-132018-09-25

Blocks openQA Tests - action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymoreResolved2018-07-162018-09-25

History

#1 Updated by nicksinger over 3 years ago

  • Blocks action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added

#2 Updated by okurz over 3 years ago

  • Category set to Feature requests

I think there was something that mudler was working on but could not find it right now. For now script_output works as designed: It does not promise to return only what was expected so the test code needs to adapt correspondingly. We did this successfully in the past and I think it should be possible in the case you linked as well.

#3 Updated by okurz over 3 years ago

  • Blocks deleted (action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable)

#4 Updated by okurz over 3 years ago

  • Related to action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added

#5 Updated by nicksinger over 3 years ago

Several reasons to object:

Setting it back as blocker.

#6 Updated by nicksinger over 3 years ago

  • Related to deleted (action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable)

#7 Updated by nicksinger over 3 years ago

  • Blocks action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added

#8 Updated by coolo over 3 years ago

  • Subject changed from script_output generates unreliable script outputs to script_output might include extra serial output
  • Priority changed from High to Normal
  • Target version set to Ready

#9 Updated by okurz over 3 years ago

An alternative to the proposed solution using a dedicated serial line could be what we are already doing in test workarounds, e.g. see https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4327/files , that is
using special markers in the output that is looked for in the serial port output, e.g.

sub script_output_safe {
    my ($cmd, %args) = @_;
    my $out = script_output "echo '>>>'$($cmd)'<<<'";
    my $expected_output;
    if ($output =~ />>>(?<expected_output>.+)<<</) {
        $expected_output = $+{expected_output};
    }
    return $expected_output;
}

or maybe even more simple:

sub script_output_safe {
    my ($cmd, %args) = @_;
    my $out = script_output "echo '>>>'$($cmd)'<<<'";
    return $+ if $output =~ />>>(?<expected_output>.+)<<</;
}

#10 Updated by okurz over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

#11 Updated by JERiveraMoya over 3 years ago

  • Related to action #26112: [sle][functional] test fails in zypper_lifecycle - bash script fails added

#12 Updated by riafarov over 3 years ago

  • Blocks deleted (action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable)

#13 Updated by riafarov over 3 years ago

  • Related to action #29646: [sle][functional][medium]test fails in zypper_lifecycle due to script_output is not always reliable added

#14 Updated by michalnowak over 3 years ago

This one just blows my mind: https://openqa.suse.de/tests/1458433#step/apache_nss/37

# wait_serial expected: <html><h2>HelloLinux</h2></html>

 # Result:
 <html><h2>HelloLinux</h2></html>

#16 Updated by riafarov over 3 years ago

So, unfortunately now we have even more complicated things on xen pv: https://openqa.suse.de/tests/1483693#step/zypper_lifecycle/29 and https://openqa.suse.de/tests/1483677#step/zypper_lifecycle/42

For the first one, we can see output on the screen: https://openqa.suse.de/tests/1483693#step/zypper_lifecycle/27 but when we got it on serial device: https://openqa.suse.de/tests/1483693#step/zypper_lifecycle/28 letter 'e' is missing. So we it's not extra serial output, but seems that we cannot rely on serial device on xen pv.

Same happened here: https://openqa.suse.de/tests/1483677#step/zypper_lifecycle/41 , but then here https://openqa.suse.de/tests/1483677#step/zypper_lifecycle/42 where b is missing

#17 Updated by michalnowak over 3 years ago

Both of those cases are on Hyper-V not Xen PV. And that's kinda expected on Hyper-V, because on Hyper-V serial console is inherently unreliable, it's leaking characters, especially for strings longer that 16 characters. zypper_lifecycle is written in such a way that workarounding it for Hyper-V means to re-write it. I'll just disable it there: https://progress.opensuse.org/issues/32029.

#18 Updated by rpalethorpe over 3 years ago

This is solved by using the Serial Terminal Console (VIRTIO_CONSOLE) which uses a separate serial port in QEMU, but it is an entirely different mode of communicating. See the new docs here: https://github.com/os-autoinst/openQA/pull/1586

It would be quite reasonable to refactor the VNC console code so that it uses the virtio_screen.pm code to read emulated serial terminal output (serial0.txt). Then when the virtio serial port is available it could use that instead of the emulated serial port (where the kernel log usually goes). So input would still be sent over VNC, but the output would be sent to a separate serial port as suggested by Nick and would be processed more efficiently. We could also modify wait_serial so that the user can 'hint' which serial port they want, because sometimes they may want to read something sent to the kernel log.

After that is done, it would also be possible to send text input over the serial terminal console (the virtio port in QEMU), but have it graphically displayed in VNC by running some software on the SUT (possibly just screen). This would make VNC almost as fast as a pure serial terminal, but still allow needles to be used, which would be very useful with TUIs like Yast. Although it might be better to contribute to VNC by implementing copy & paste properly.

This should also apply to other consoles which are using SSH, IPMI or physical serial port(s).

Currently I am refactoring the script_output code to move it to distribution.pm to help fix a race condition.

#19 Updated by riafarov over 3 years ago

  • Is duplicate of action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksum added

#20 Updated by riafarov over 3 years ago

  • Is duplicate of deleted (action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksum)

#21 Updated by riafarov over 3 years ago

  • Related to action #26108: [sle][functional][u][sporadic] test fails in btrfs_send_receive - checksum of snapshot does not match original checksum added

#22 Updated by okurz over 3 years ago

  • Status changed from In Progress to Feedback

So my proposed change is live and deployed but does not prevent the problems as intended. In https://openqa.suse.de/tests/1483794/file/serial0.txt we can see

t9vXJ
c2066953541fa467dac88003fb3f9896  -
SCRIPT_FINISHEDt9vXJ-0-

showing the begin and end marker

but then we have

Welcome to SUSE Linux Enterprise Server 15 Beta7 (ppc64le) - Kernel 4.12.14-12.3-default (hvc0).

eth0: 10.0.2.15 fec0::14fb:1a86:faa9:82c4


susetest login: c2066953541fa467dac88003fb3f9896  -
SCRIPT_FINISHEDt9vXJ-0-

rpalethorpe did some cool changes. How can we reevaluate the current situation? Everyone that observes problems in script_output please mention URLs here and describe the symptoms.

#23 Updated by rpalethorpe over 3 years ago

Even after my changes to script_output there still appears to be some issue on ARM with cat: https://openqa.suse.de/tests/1534756. We should see 'cat /tmp/ver_linux_before.txt' after calling 'cat - > /tmp/scriptORQ2Z.sh; echo ORQ2Z-$?-', but it is totally missing. There is probably some race condition here. Possibly we could wait for the SUT to echo the new line character.

#26 Updated by okurz over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode+role_textmode
https://openqa.suse.de/tests/1570712

#27 Updated by okurz over 3 years ago

  • Related to action #33958: [sle][functional][u][fast] test fails in zypper_lifecycle - 'end of support' line not found added

#28 Updated by okurz over 3 years ago

Another idea I am having: Why not just disable the getty service on the serial console which seems to cause most if not all of the problems now? We are not making use of it during the normal tests anyway, right? I would not consider it a workaround because you can say that script_output currently expects that the serial terminal used is expecting it to be reserved and not shared. We can extend the script_output documentation of course.

#29 Updated by okurz over 3 years ago

  • Related to action #29637: [tools][kernel] test suddenly print login words -> script_output used wrong added

#30 Updated by okurz over 3 years ago

  • Subject changed from script_output might include extra serial output to [functional][u]script_output might include extra serial output
  • Due date set to 2018-05-22
  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)
  • Target version changed from Ready to Milestone 16

#31 Updated by riafarov over 3 years ago

  • Subject changed from [functional][u]script_output might include extra serial output to [functional][u][hard] script_output might include extra serial output

NOTE: ticket includes just trying out solution from comment #28 and not completely new solution.

#32 Updated by okurz over 3 years ago

  • Due date changed from 2018-05-22 to 2018-06-05

I don't see it being done this sprint, let's delay it for one sprint.

#33 Updated by SLindoMansilla over 3 years ago

  • Due date changed from 2018-06-05 to 2018-06-19

Not enough capacity during sprint 18

#34 Updated by okurz over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: gnome+proxy_SCC+allmodules
https://openqa.suse.de/tests/1748781

#35 Updated by okurz over 3 years ago

  • Target version changed from Milestone 16 to Milestone 17

we are in M17 by now

#36 Updated by okurz over 3 years ago

  • Target version changed from Milestone 17 to Milestone 17

#37 Updated by mgriessmeier over 3 years ago

  • Subject changed from [functional][u][hard] script_output might include extra serial output to [functional][y][hard] script_output might include extra serial output
  • Due date changed from 2018-06-19 to 2018-07-03

revisit in planning to help everyone understand what's the next step to do

#38 Updated by okurz over 3 years ago

  • Description updated (diff)

So we had been shifting this ticket around a lot :) Let's try one more sprint on [y] otherwise we shift to "future".

#39 Updated by JERiveraMoya over 3 years ago

  • Estimated time set to 13.00 h

#40 Updated by riafarov over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to riafarov

#41 Updated by riafarov over 3 years ago

  • Status changed from In Progress to Resolved

So idea with getty didn't work well, getty is not running against serial console and all what we see is duplicated output which we get for other ttys. So, I guess the only way to solve this issue completely is to use second serial device which will be dedicated for script_output.

okurz, I suggest removing this ticket from the sprint or we can resolve it. Feel free to reschedule. I also believe we have mitigated issue already with workarounds, but it's still is the reason of some sporadic failures. So we can consider applying second suggestion, which will be tricky for some backends (e.g. ipmi).

#42 Updated by riafarov over 3 years ago

  • Due date deleted (2018-07-03)
  • Status changed from Resolved to Workable
  • Assignee deleted (riafarov)

#43 Updated by riafarov over 3 years ago

  • Due date set to 2018-07-03

#44 Updated by okurz over 3 years ago

riafarov wrote:

So idea with getty didn't work well, getty is not running against serial console and all what we see is duplicated output which we get for other ttys

Can you reference tests you ran that support this expectation? I do not understand how "duplicated output for other ttys" could show up on the serial port.

I failed to find a recently failing job on this with a quick search. The one mentioned in #30613#note-34 is actually not this ticket, the failure here is not related to "unexpected serial output" so please do not label with a progress ticket. It looks like either the lifecycle data has not been updated or we need to update the test module to cope with EOL reports on the legacy module. I suggest to clarify in a bug report.

#45 Updated by riafarov about 3 years ago

okurz As you may remember why original idea didn't work, because when we set tty boot parameters we have, that serial device also is used as /dev/console and hence gets everything we also see on tty we use.

As for runs, getty service is started for each tty we activate e.g. getty@tty1, but there is no getty service started against our serial console (ttyS0 on qemu). So, please, let me know if you meant something else with the idea of disabling getty.

#46 Updated by okurz about 3 years ago

hm, if there is not getty running on $serialdev what provides the login prompt then? Or is it really because of /dev/console? I doubt it. Btw, have we ever seen it on x86_64 or only aarch64+ppc64le?

#47 Updated by riafarov about 3 years ago

  • Assignee set to riafarov

It mostly affects ppc64. We get login prompt for tty which we select, and we don't switch to ttyS0. Let me double check it. I've taken a look on my runs and there command failed.

#48 Updated by riafarov about 3 years ago

So, here is the run I did: http://g226.suse.de/tests/1997#step/consoletest_setup/27 but as you can see on serial, it doesn't really help for ppc64le and we still get login prompt on serial:
http://g226.suse.de/tests/1997/file/serial0.txt

For x86 I got same number of "Welcome" messages on serial as without stopped getty@ttyS0, so my quick guess is that it's related to /dev/console, but obviously I can be wrong.

#49 Updated by pvorel about 3 years ago

riafarov wrote:

It mostly affects ppc64. We get login prompt for tty which we select, and we don't switch to ttyS0. Let me double check it. I've taken a look on my runs and there command failed.

As you talk about ppc64, ttyS0 and /dev/console, this is probably not related, but just in case you don't know that:
So on ppc64 it's hvc0 instead of ttyS0.

There is actually comment in this ticket proving this:
Welcome to SUSE Linux Enterprise Server 15 Beta7 (ppc64le) - Kernel 4.12.14-12.3-default (hvc0).

OT: We had to do some quirks for LTP on ppc64 running with VIRTIO_CONSOLE we had to add hvc1 (#18980 https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/kernel/install_ltp.pm#L259).

#51 Updated by pvorel about 3 years ago

FYI: We suffer occasional problems with LTP since this commit:
b3ee11ed LTP: Merge ltp_setup_networking.pm into install_ltp.pm

https://openqa.opensuse.org/tests/698252#step/install_ltp/141
AFAIK we never had any problem on x86_64 before. I'll test your PR.

#52 Updated by okurz about 3 years ago

But this LTP is not even using the serial device but the virtio based serial terminal. I hardly doubt that is related.

#53 Updated by riafarov about 3 years ago

pvorel Hmm, PR I've created hasn't been yet merged. Can it be related to workers performance?

#54 Updated by riafarov about 3 years ago

  • Due date changed from 2018-07-03 to 2018-07-31

#55 Updated by pvorel about 3 years ago

okurz wrote:

But this LTP is not even using the serial device but the virtio based serial terminal. I hardly doubt that is related.

OK, I'm sorry for unrelated comment.

riafarov wrote:

pvorel Hmm, PR I've created hasn't been yet merged. Can it be related to workers performance?

I have no idea.

#56 Updated by okurz about 3 years ago

So now I wonder if we still need any workarounds, I proposed https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5379 to remove workarounds. Would you like to test the PR?

#57 Updated by riafarov about 3 years ago

okurz, we have discussed a problem of changing scope of the ticket, in this case I would prefer resolving this one and having separate ticket for removing other workarounds. On top, we have only mitigated an issue, so other workarounds still may be useful.

#58 Updated by riafarov about 3 years ago

  • Related to action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymore added

#59 Updated by riafarov about 3 years ago

  • Status changed from Feedback to Resolved

#60 Updated by mgriessmeier about 3 years ago

  • Status changed from Resolved to In Progress

This needs to be rediscussed, since it broke s390 and svirt reboots due to masking the serialdev service in the creation job and not unmasking it again.
see e.g. https://openqa.suse.de/tests/1838736

we've reverted the PR with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5427/

#61 Updated by okurz about 3 years ago

  • Target version changed from Milestone 17 to Milestone 18

riafarov feel free to unassign after we reopened now but please restate your explanation from the revert PR why we broke it

#62 Updated by okurz about 3 years ago

  • Related to action #37027: [tw][desktop] Update create_hdd_* to make sure right permissions added

#63 Updated by okurz about 3 years ago

  • Related to action #38480: [functional][u][fast][qam] job mau-extratests fails in new tests update_alternatives, aplay, openssl_alpn, curl_ipv6, wget_ipv6, gpg, shells added

#64 Updated by riafarov about 3 years ago

  • Due date changed from 2018-07-31 to 2018-08-14

E.g. split this part to ensure_serial_console_permissions.

#65 Updated by riafarov about 3 years ago

  • Related to deleted (action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymore)

#66 Updated by riafarov about 3 years ago

  • Blocks action #38417: [sle][functional][u][easy] Cleanup workarounds for serial output which are not required anymore added

#67 Updated by riafarov about 3 years ago

  • Status changed from In Progress to Feedback

We are waiting when PR to split consoletest_setup will get merged and then make a decision where to add part to stop service. As mentioned above, we should have it enabled when reboot on svirt and zVM, as read serial to detect grub, welcome screens.

#69 Updated by riafarov about 3 years ago

  • Related to action #39665: [functional][u][ipmi] test fails in system_prepare, /dev/sshserial is used instead of /dev/ttyS1 added

#70 Updated by riafarov about 3 years ago

  • Due date changed from 2018-08-14 to 2018-08-28

#72 Updated by riafarov about 3 years ago

  • Status changed from Feedback to Resolved

Failure is not related to the change, but to the worker performance, we have quite some sporadic failures on it. Here is the run with less jobs running on it: https://openqa.suse.de/tests/1945118#step/consoletest_setup/19

This issues will be addressed in separate tickets.

#73 Updated by cfconrad about 3 years ago

Created a new ticket [1] for the cases where script_output() fail in "cat -" and never return to prompt.

Which was mentioned in note-23 [2] but seems to be a different issue.

[1] https://progress.opensuse.org/issues/40913
[2] https://progress.opensuse.org/issues/30613#note-23

Also available in: Atom PDF