action #60176

[kernel][s390x] tests look for login prompt just after they are already logged in

Added by MDoucha 3 months ago. Updated 3 months ago.

Status:NewStart date:22/11/2019
Priority:NormalDue date:
Assignee:cfconrad% Done:

0%

Category:Bugs in existing tests
Target version:-
Difficulty:
Duration:

Description

Since 2019-11-20 around 09:50, all LTP install jobs running on grenache/s390-kvm-sle12 are timing out while waiting for login prompt.
SLE-12SP2: https://openqa.suse.de/tests/3610367#step/install_ltp/23
SLE-12SP4: https://openqa.suse.de/tests/3615783#step/install_ltp/23
SLE-12SP5: https://openqa.suse.de/tests/3615915#step/install_ltp/23

The login prompt appears on serial console shortly after wait_serial times out: https://openqa.suse.de/tests/3610367#step/install_ltp/27

SLE-15GA and SLE-15SP1 jobs run fine, most likely because they use zkvm workers.


Related issues

Blocks openQA Tests - action #60115: [sle][functional][u] test fails in sshd - login prompt do... Blocked 21/11/2019

History

#1 Updated by okurz 3 months ago

  • Project changed from openQA Infrastructure to openQA Tests
  • Subject changed from s390 performance issue on grenache to s390x tests look for login prompt just after they are already logged in
  • Category set to Bugs in existing tests
  • Priority changed from Normal to High

The examples you mentioned show a completely different problem than "performance issues". One can see in https://openqa.suse.de/tests/3615783#step/install_ltp/18 the login happening, then the root console is detected, then it is detected again and then the test looks for login prompt. The timestamp you mentioned is two hours after our weekly deployment so it could very well be os-autoinst or openQA changes but also it could be tests. Please rule out any test impact first.

#2 Updated by MDoucha 3 months ago

okurz wrote:

The examples you mentioned show a completely different problem than "performance issues". One can see in https://openqa.suse.de/tests/3615783#step/install_ltp/18 the login happening, then the root console is detected, then it is detected again and then the test looks for login prompt. The timestamp you mentioned is two hours after our weekly deployment so it could very well be os-autoinst or openQA changes but also it could be tests. Please rule out any test impact first.

Here's a successful run doing the exact same sequence of operations:
1. Wait for login prompt on VNC console
2. Log in on VNC console
3. Switch to SSH console and wait for login prompt
4. Log in on SSH console
https://openqa.suse.de/tests/3609778#step/install_ltp/23

This job ran on 2019-11-20 01:46, about 8 hours before the first failure.

#3 Updated by MDoucha 3 months ago

Looking at os-autoinst changelog, this looks like the most likely culprit: https://github.com/os-autoinst/os-autoinst/pull/1166

The easiest way to check is downgrading os-autoinst on grenache and restarting one of the failed install_ltp jobs.

#4 Updated by szarate 3 months ago

  • Related to action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal added

#5 Updated by MDoucha 3 months ago

The same problem also affects openqaworker5/s390-kvm-sle15:
https://openqa.suse.de/tests/3611061#step/install_ltp/24

Last successful job ran on 2019-11-20 03:59 (~6 hours before first failure) for the exact same build with the exact same repo hash. The job was restarted because some additional incident IDs were added to job settings. So to sum it up:
* multiple worker machines affected
* different test results for the same SLE build before and after OpenQA deployment
* multiple successful runs of install_ltp test module on s390-kvm-* since the last change: https://github.com/os-autoinst/os-autoinst-distri-opensuse/commits/master/tests/kernel/install_ltp.pm

#6 Updated by mkittler 3 months ago

We could revert https://github.com/os-autoinst/os-autoinst/pull/1166 on one of the production workers to check whether this works again.

#7 Updated by pvorel 3 months ago

@mkittler I'll test reverted PR#166 on my worker, which uses s390zp13.suse.de.

#8 Updated by pvorel 3 months ago

OK, in the end cfconrad is solving this.

#9 Updated by jlausuch 3 months ago

  • Assignee set to cfconrad

#10 Updated by cfconrad 3 months ago

Finally I'm able to reproduce it on fromm.arch.suse.de.
And unfortunately it is related to PR#1166.

With PR#1166 http://fromm.arch.suse.de/tests/293
Without PR#1166 http://fromm.arch.suse.de/tests/294

The problem seems that somehow wait_until() didn't get the login prompt.

I would like to investigate a bit more, if it is ok?!

#11 Updated by cfconrad 3 months ago

I noticed the following log message in https://openqa.suse.de/tests/3615915/file/autoinst-log.txt

[2019-11-20T10:19:57.708 CET] [debug] [run_ssh_cmd(grep -q '^CONSOLE_EXIT_FJl5bTJGaRvcUiaF:' /tmp/serial_terminal.txt.FJl5bTJGaRvcUiaF)] stderr:
grep: /tmp/serial_terminal.txt.FJl5bTJGaRvcUiaF: No such file or directory

With PR#1166 the timing changed, as we do not create a new SSH connection within each run_ssh_cmd() call.

Might be a fix: https://github.com/os-autoinst/os-autoinst/pull/1281

#12 Updated by okurz 3 months ago

PR merged. For this ticket though please keep in mind that I still think the logic seems to be flawed – could also be related to the backend – because there are multiple, different checks for the login prompt, the bash prompt and login prompt again which only matches by accident.

#13 Updated by mkittler 3 months ago

I'm wondering what this 'script' command actually does and why it isn't blocking until it is done. The PR looks like it might work but also looks more like a workaround than a real fix.

#14 Updated by cfconrad 3 months ago

@mkittler this script command is started and works like a interception between STDIN/STDOUT and virsh console while logging everything into logfile.
(The idea of this program comes from teaching students, so they can record there shell session to remember/evaluate what they did.)

It was introduced by https://github.com/os-autoinst/os-autoinst/commit/fb36b1683a42afc9421eb52c9834057fac8f45fc maybe @pvorel can say a bit more.

My assumption from this PR is, that script does open(logfile) after setting up all other input/output. That's why I assume, if the logfile exists, everything is ready to receive data.

#15 Updated by pvorel 3 months ago

script was added in https://github.com/os-autoinst/os-autoinst/pull/1109, in commit https://github.com/os-autoinst/os-autoinst/pull/1109/commits/fb36b1683a42afc9421eb52c9834057fac8f45fc

script -f $log just save output into log file. It's a hack to log every command output on SUT.

my $marker = "CONSOLE_EXIT_" . get_required_var('JOBTOKEN') . ":"
$cmd_full = "script -f $log -c '$cmd; echo \"$marker \$?\"'";

It's logging into /tmp/serial_terminal.txt.FJl5bTJGaRvcUiaF ($self->serial_terminal_log_file()),
$marker is used to detect exit code of $cmd, which is for s390x 'virsh console'.

#16 Updated by pvorel 3 months ago

Sorry for useless comment, I overlooked Clement's comment 15 so I more or less repeat it.
Do you need any help with the ticket?

#17 Updated by pvorel 3 months ago

okurz wrote:

PR merged. For this ticket though please keep in mind that I still think the logic seems to be flawed – could also be related to the backend – because there are multiple, different checks for the login prompt, the bash prompt and login prompt again which only matches by accident.

Based on chat with cfconrad: "http://fromm.arch.suse.de/tests/298#step/install_ltp/18 looks for password prompt, good, http://fromm.arch.suse.de/tests/298#step/install_ltp/19 looks for "text-logged-in-root", also good. The next step looking for "root-console" in http://fromm.arch.suse.de/tests/298#step/install_ltp/20 is already redundant. http://fromm.arch.suse.de/tests/298#step/install_ltp/21 checks the same again and http://fromm.arch.suse.de/tests/298#step/install_ltp/22 looks for "linux-login" which is wrong, because we are obviously already logged in. However, by accident the needle matches for what should be only a "login prompt", not the logged in bash prompt"

All this is caused by test end communication to backend API, code from lib/susedistribution.pm and lib/opensusebasetest.pm:

everything before "ssh port open" "check for port 22 on 10.161.145.21 successful" from http://fromm.arch.suse.de/tests/298#step/install_ltp/16 is call wait_boot()
from lib/opensusebasetest.pm called from install_ltp.pm

"which tput 2>&1 && PS1="[$(tput bold 2; ..." from http://fromm.arch.suse.de/tests/298#step/install_ltp/20 is set_standard_prompt(), which is called by activate_console(), called by $self->select_serial_terminal in install_ltp.pm (=> select_console($console)).

=> there is still a lot of room for improvement of opensuse/SLE inicialization in lib/susedistribution.pm.

#18 Updated by okurz 3 months ago

well I did not blame an individual test module and as I stated it could also be that there is some code in the backend not only the test code. I do not recall seeing this duplication in other test flows though so I think at least "install_ltp" is related.

I can see that for example in sle-12-SP5-Server-DVD-s390x-default@s390x-kvm-sle12 there is also a slight duplication, e.g. in https://openqa.suse.de/tests/3576395#step/system_prepare/4 where we check for the root console again whoever it is not checking for the login prompt again.

#19 Updated by szarate 3 months ago

  • Subject changed from s390x tests look for login prompt just after they are already logged in to [kernel] s390x tests look for login prompt just after they are already logged in

#20 Updated by pvorel 3 months ago

  • Subject changed from [kernel] s390x tests look for login prompt just after they are already logged in to [kernel][s390x] tests look for login prompt just after they are already logged in

#21 Updated by pvorel 3 months ago

okurz wrote:

well I did not blame an individual test module and as I stated it could also be that there is some code in the backend not only the test code. I do not recall seeing this duplication in other test flows though so I think at least "install_ltp" is related.

Maybe wait_boot() or select_console() is not needed to be called in install_ltp for s390x. The latter could be solved in select_serial_terminal().
Please if you create some ticket about it, let me know.

#22 Updated by cfconrad 3 months ago

  • Priority changed from High to Normal

Lowered the prio, as we now only try to improve certain login calls.

#23 Updated by mgriessmeier about 1 month ago

  • Related to deleted (action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal )

#24 Updated by mgriessmeier about 1 month ago

  • Blocks action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal added

Also available in: Atom PDF