Project

General

Profile

Actions

action #60176

closed

[kernel][s390x] tests look for login prompt just after they are already logged in

Added by MDoucha over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2019-11-22
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Since 2019-11-20 around 09:50, all LTP install jobs running on grenache/s390-kvm-sle12 are timing out while waiting for login prompt.
SLE-12SP2: https://openqa.suse.de/tests/3610367#step/install_ltp/23
SLE-12SP4: https://openqa.suse.de/tests/3615783#step/install_ltp/23
SLE-12SP5: https://openqa.suse.de/tests/3615915#step/install_ltp/23

The login prompt appears on serial console shortly after wait_serial times out: https://openqa.suse.de/tests/3610367#step/install_ltp/27

SLE-15GA and SLE-15SP1 jobs run fine, most likely because they use zkvm workers.


Related issues 1 (0 open1 closed)

Has duplicate openQA Tests - action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal RejectedSLindoMansilla

Actions
Actions #1

Updated by okurz over 4 years ago

  • Project changed from openQA Infrastructure to openQA Tests
  • Subject changed from s390 performance issue on grenache to s390x tests look for login prompt just after they are already logged in
  • Category set to Bugs in existing tests
  • Priority changed from Normal to High

The examples you mentioned show a completely different problem than "performance issues". One can see in https://openqa.suse.de/tests/3615783#step/install_ltp/18 the login happening, then the root console is detected, then it is detected again and then the test looks for login prompt. The timestamp you mentioned is two hours after our weekly deployment so it could very well be os-autoinst or openQA changes but also it could be tests. Please rule out any test impact first.

Actions #2

Updated by MDoucha over 4 years ago

okurz wrote:

The examples you mentioned show a completely different problem than "performance issues". One can see in https://openqa.suse.de/tests/3615783#step/install_ltp/18 the login happening, then the root console is detected, then it is detected again and then the test looks for login prompt. The timestamp you mentioned is two hours after our weekly deployment so it could very well be os-autoinst or openQA changes but also it could be tests. Please rule out any test impact first.

Here's a successful run doing the exact same sequence of operations:

  1. Wait for login prompt on VNC console
  2. Log in on VNC console
  3. Switch to SSH console and wait for login prompt
  4. Log in on SSH console https://openqa.suse.de/tests/3609778#step/install_ltp/23

This job ran on 2019-11-20 01:46, about 8 hours before the first failure.

Actions #3

Updated by MDoucha over 4 years ago

Looking at os-autoinst changelog, this looks like the most likely culprit: https://github.com/os-autoinst/os-autoinst/pull/1166

The easiest way to check is downgrading os-autoinst on grenache and restarting one of the failed install_ltp jobs.

Actions #4

Updated by szarate over 4 years ago

  • Related to action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal added
Actions #5

Updated by MDoucha over 4 years ago

The same problem also affects openqaworker5/s390-kvm-sle15:
https://openqa.suse.de/tests/3611061#step/install_ltp/24

Last successful job ran on 2019-11-20 03:59 (~6 hours before first failure) for the exact same build with the exact same repo hash. The job was restarted because some additional incident IDs were added to job settings. So to sum it up:

Actions #6

Updated by mkittler over 4 years ago

We could revert https://github.com/os-autoinst/os-autoinst/pull/1166 on one of the production workers to check whether this works again.

Actions #7

Updated by pvorel over 4 years ago

@mkittler I'll test reverted PR#166 on my worker, which uses s390zp13.suse.de.

Actions #8

Updated by pvorel over 4 years ago

OK, in the end cfconrad is solving this.

Actions #9

Updated by jlausuch over 4 years ago

  • Assignee set to cfconrad
Actions #10

Updated by cfconrad over 4 years ago

Finally I'm able to reproduce it on fromm.arch.suse.de.
And unfortunately it is related to PR#1166.

With PR#1166 http://fromm.arch.suse.de/tests/293
Without PR#1166 http://fromm.arch.suse.de/tests/294

The problem seems that somehow wait_until() didn't get the login prompt.

I would like to investigate a bit more, if it is ok?!

Actions #11

Updated by cfconrad over 4 years ago

I noticed the following log message in https://openqa.suse.de/tests/3615915/file/autoinst-log.txt

[2019-11-20T10:19:57.708 CET] [debug] [run_ssh_cmd(grep -q '^CONSOLE_EXIT_FJl5bTJGaRvcUiaF:' /tmp/serial_terminal.txt.FJl5bTJGaRvcUiaF)] stderr:
grep: /tmp/serial_terminal.txt.FJl5bTJGaRvcUiaF: No such file or directory

With PR#1166 the timing changed, as we do not create a new SSH connection within each run_ssh_cmd() call.

Might be a fix: https://github.com/os-autoinst/os-autoinst/pull/1281

Actions #12

Updated by okurz over 4 years ago

PR merged. For this ticket though please keep in mind that I still think the logic seems to be flawed – could also be related to the backend – because there are multiple, different checks for the login prompt, the bash prompt and login prompt again which only matches by accident.

Actions #13

Updated by mkittler over 4 years ago

I'm wondering what this 'script' command actually does and why it isn't blocking until it is done. The PR looks like it might work but also looks more like a workaround than a real fix.

Actions #14

Updated by cfconrad over 4 years ago

@mkittler this script command is started and works like a interception between STDIN/STDOUT and virsh console while logging everything into logfile.
(The idea of this program comes from teaching students, so they can record there shell session to remember/evaluate what they did.)

It was introduced by https://github.com/os-autoinst/os-autoinst/commit/fb36b1683a42afc9421eb52c9834057fac8f45fc maybe @pvorel can say a bit more.

My assumption from this PR is, that script does open(logfile) after setting up all other input/output. That's why I assume, if the logfile exists, everything is ready to receive data.

Actions #15

Updated by pvorel over 4 years ago

script was added in https://github.com/os-autoinst/os-autoinst/pull/1109, in commit https://github.com/os-autoinst/os-autoinst/pull/1109/commits/fb36b1683a42afc9421eb52c9834057fac8f45fc

script -f $log just save output into log file. It's a hack to log every command output on SUT.

my $marker = "CONSOLE_EXIT_" . get_required_var('JOBTOKEN') . ":"
$cmd_full = "script -f $log -c '$cmd; echo \"$marker \$?\"'";

It's logging into /tmp/serial_terminal.txt.FJl5bTJGaRvcUiaF ($self->serial_terminal_log_file()),
$marker is used to detect exit code of $cmd, which is for s390x 'virsh console'.

Actions #16

Updated by pvorel over 4 years ago

Sorry for useless comment, I overlooked Clement's comment 15 so I more or less repeat it.
Do you need any help with the ticket?

Actions #17

Updated by pvorel over 4 years ago

okurz wrote:

PR merged. For this ticket though please keep in mind that I still think the logic seems to be flawed – could also be related to the backend – because there are multiple, different checks for the login prompt, the bash prompt and login prompt again which only matches by accident.

Based on chat with cfconrad: "http://fromm.arch.suse.de/tests/298#step/install_ltp/18 looks for password prompt, good, http://fromm.arch.suse.de/tests/298#step/install_ltp/19 looks for "text-logged-in-root", also good. The next step looking for "root-console" in http://fromm.arch.suse.de/tests/298#step/install_ltp/20 is already redundant. http://fromm.arch.suse.de/tests/298#step/install_ltp/21 checks the same again and http://fromm.arch.suse.de/tests/298#step/install_ltp/22 looks for "linux-login" which is wrong, because we are obviously already logged in. However, by accident the needle matches for what should be only a "login prompt", not the logged in bash prompt"

All this is caused by test end communication to backend API, code from lib/susedistribution.pm and lib/opensusebasetest.pm:

everything before "ssh port open" "check for port 22 on 10.161.145.21 successful" from http://fromm.arch.suse.de/tests/298#step/install_ltp/16 is call wait_boot()
from lib/opensusebasetest.pm called from install_ltp.pm

"which tput 2>&1 && PS1="[$(tput bold 2; ..." from http://fromm.arch.suse.de/tests/298#step/install_ltp/20 is set_standard_prompt(), which is called by activate_console(), called by $self->select_serial_terminal in install_ltp.pm (=> select_console($console)).

=> there is still a lot of room for improvement of opensuse/SLE inicialization in lib/susedistribution.pm.

Actions #18

Updated by okurz over 4 years ago

well I did not blame an individual test module and as I stated it could also be that there is some code in the backend not only the test code. I do not recall seeing this duplication in other test flows though so I think at least "install_ltp" is related.

I can see that for example in sle-12-SP5-Server-DVD-s390x-default@s390x-kvm-sle12 there is also a slight duplication, e.g. in https://openqa.suse.de/tests/3576395#step/system_prepare/4 where we check for the root console again whoever it is not checking for the login prompt again.

Actions #19

Updated by szarate over 4 years ago

  • Subject changed from s390x tests look for login prompt just after they are already logged in to [kernel] s390x tests look for login prompt just after they are already logged in
Actions #20

Updated by pvorel over 4 years ago

  • Subject changed from [kernel] s390x tests look for login prompt just after they are already logged in to [kernel][s390x] tests look for login prompt just after they are already logged in
Actions #21

Updated by pvorel over 4 years ago

okurz wrote:

well I did not blame an individual test module and as I stated it could also be that there is some code in the backend not only the test code. I do not recall seeing this duplication in other test flows though so I think at least "install_ltp" is related.

Maybe wait_boot() or select_console() is not needed to be called in install_ltp for s390x. The latter could be solved in select_serial_terminal().
Please if you create some ticket about it, let me know.

Actions #22

Updated by cfconrad over 4 years ago

  • Priority changed from High to Normal

Lowered the prio, as we now only try to improve certain login calls.

Actions #23

Updated by mgriessmeier over 4 years ago

  • Related to deleted (action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal )
Actions #24

Updated by mgriessmeier over 4 years ago

  • Blocks action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal added
Actions #25

Updated by SLindoMansilla almost 4 years ago

  • Blocks deleted (action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal )
Actions #26

Updated by SLindoMansilla almost 4 years ago

  • Has duplicate action #60115: [sle][functional][u] test fails in sshd - login prompt doesn't show up on serial_terminal added
Actions #27

Updated by cfconrad almost 4 years ago

  • Status changed from New to Resolved

For me it looks like there where already some changes in that flow: https://openqa.suse.de/tests/4200513#step/install_ltp/19
Also, avoid looking twice for root-console-top-20200304 wouldn't gain much speed up.

I don't see a need for further investigation for now.

Actions

Also available in: Atom PDF