Project

General

Profile

Actions

action #34003

closed

[tools] Better logging and error handling in case of remote console connections in consoles or backends, e.g. ssh

Added by okurz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2018-03-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See for example in gh#os-autoinst/os-autoinst-distri-opensuse#4629 we struggle a lot to find out why a select_console 'root-console' call fails for the s390x-kvm-sle12 machine using the svirt backend relying on an ssh-connection within an xterm-console within a local Xvnc server. If it fails we are mainly left with a mouse cursor on a black screen and nobody knows what is going on

Acceptance criteria

  • AC1: logs tell when a remote connection, e.g. over SSH or VNC is made
  • AC2: At least a hint is shown in logs or as thumbnail in openQA details view what could have gone wrong

Tasks

  • Research how the svirt backend conducts ssh connections using the according console
  • Add some logging with the according parameters, e.g. from where to where a connection is made
  • Add error handling and reporting in case something goes wrong
  • Take a look into other backends/consoles if maybe the same approach could be applied there

Related issues 5 (0 open5 closed)

Related to openQA Tests (public) - action #33199: [sle][functional][s390x][zkvm][u][hard] test fails in kdump_and_crash - system does not shutdown or reboot? what is happening? better output needed?Resolvedmgriessmeier2018-03-132018-04-10

Actions
Related to openQA Tests (public) - action #35641: [sle][functional][u][sporadic][mistyping] test fails in svirt_upload_assets - unchanged PS1Rejected2018-04-27

Actions
Blocks openQA Tests (public) - action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boot_to_desktop - still insufficient error reporting, black screen with mouse cursor - we all hate it (was: I hate it)Resolvedmgriessmeier2018-03-132018-08-14

Actions
Blocks openQA Tests (public) - action #33865: [sles][functional][s390x][easy][y] Enable yast2_ncurses testsuite for s390xResolvedriafarov2018-03-27

Actions
Blocks openQA Tests (public) - action #36268: [sle][functional][u][hard][s390x] test fails in consoletest_finish - screen does not show desktop sessionResolvedoorlov2018-05-162018-09-11

Actions
Actions #1

Updated by okurz over 6 years ago

  • Related to action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boot_to_desktop - still insufficient error reporting, black screen with mouse cursor - we all hate it (was: I hate it) added
Actions #2

Updated by okurz over 6 years ago

  • Related to action #33199: [sle][functional][s390x][zkvm][u][hard] test fails in kdump_and_crash - system does not shutdown or reboot? what is happening? better output needed? added
Actions #3

Updated by mgriessmeier over 6 years ago

  • Subject changed from [functional][u]Better logging and error handling in case of remote console connections in backends, e.g. ssh to [functional][u][medium] Better logging and error handling in case of remote console connections in backends, e.g. ssh
  • Status changed from New to Workable
Actions #4

Updated by okurz over 6 years ago

  • Related to action #34837: [sle12][functional][u][s390x-kvm] test fails in consoletest_finish while switching from console to GUI desktop, stuck on black screen added
Actions #5

Updated by mgriessmeier over 6 years ago

  • Due date changed from 2018-04-24 to 2018-05-08
  • Target version changed from Milestone 15 to Milestone 16
Actions #6

Updated by okurz over 6 years ago

  • Related to deleted (action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boot_to_desktop - still insufficient error reporting, black screen with mouse cursor - we all hate it (was: I hate it))
Actions #7

Updated by okurz over 6 years ago

  • Blocks action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boot_to_desktop - still insufficient error reporting, black screen with mouse cursor - we all hate it (was: I hate it) added
Actions #8

Updated by okurz over 6 years ago

  • Subject changed from [functional][u][medium] Better logging and error handling in case of remote console connections in backends, e.g. ssh to [tools] Better logging and error handling in case of remote console connections in backends, e.g. ssh
  • Due date deleted (2018-05-08)
  • Target version deleted (Milestone 16)

Dear tools-team, this feature would help QSF and hopefully other teams very much. Would you like to take a look into this please?

Actions #9

Updated by okurz over 6 years ago

  • Related to action #35641: [sle][functional][u][sporadic][mistyping] test fails in svirt_upload_assets - unchanged PS1 added
Actions #10

Updated by okurz over 6 years ago

  • Blocks action #33865: [sles][functional][s390x][easy][y] Enable yast2_ncurses testsuite for s390x added
Actions #11

Updated by coolo over 6 years ago

I don't really understand what you expect here. Agressive debug output from ssh?

Actions #12

Updated by coolo over 6 years ago

So, I did look into the issue and there doesn't seem to be a problem with ssh actually. The failure looks like this:

https://openqa.suse.de/tests/1656222#step/boot_to_desktop/21

While if the xterm started, it looks like this:
https://openqa.suse.de/tests/1563415#step/boot_to_desktop/19

What you notice in the failed log:
[2018-05-01T08:50:09.0524 CEST] [debug] <<< testapi::select_console(testapi_console='root-console')
/usr/lib/os-autoinst/consoles/vnc_base.pm:71:{
'hostname' => 'localhost',
'port' => 35180,
'ikvm' => 0
}
IceWM: using /var/lib/empty/.icewm for private configuration files
11841
[2018-05-01T08:50:10.0777 CEST] [debug] activate_console, console: root-console, type: console
[2018-05-01T08:50:10.0777 CEST] [debug] backend s390x || zkvm
[2018-05-01T08:50:10.0778 CEST] [debug] /var/lib/openqa/cache/tests/sle/tests/boot/boot_to_desktop.pm:27 called opensusebasetest::wait_boot
[2018-05-01T08:50:10.0778 CEST] [debug] <<< testapi::assert_screen(mustmatch='password-prompt', timeout=30)
[2018-05-01T08:50:10.0833 CEST] [debug] MATCH(password-prompt-20141126:0.00)
[2018-05-01T08:50:10.0838 CEST] [info] Collected unknown process with pid 11841 and exit status: 0

Note that 11841 is the xterm/Xvnc pid - I don't know why it crashes/stops, but it has nothing to do with ssh

Actions #13

Updated by okurz over 6 years ago

coolo wrote:

[…]
Note that 11841 is the xterm/Xvnc pid - I don't know why it crashes/stops, but it has nothing to do with ssh

Well, then I consider the "Xvnc has vanished" as part of the story.

Do the Acceptance criteria make sense to you?

Take a look into a standard test with "remote backend", e.g. default@s390x-kvm. Looking for select_console I see there blocks like

[2018-05-03T23:28:36.0547 CEST] [debug] <<< testapi::select_console(testapi_console='svirt', await_console=0)
[2018-05-03T23:28:36.0628 CEST] [debug] Connection to root@s390p8.suse.de established
/usr/lib/os-autoinst/consoles/vnc_base.pm:71:{
  'port' => 56100,

which is already helpful because we know that an ssh-connection is established. However we the part about VNC is not really clear to test reviewers. So I think one debug line could be helpful. And in case the ssh connection is not working it might be hard to understand who tried to connect to where (from where) and why so I suggest a debug line there as well that say something like "[debug] select_console 'svirt': Connecting from '' to ''". The related tickets show some examples, please see them as well. For example https://openqa.suse.de/tests/1536522#step/boot_to_desktop/20 is a case of a black screen and it is not obvious from the logs what is going wrong.

Actions #14

Updated by coolo over 6 years ago

the ssh connections don't happen in backends - they happen in consoles

Actions #15

Updated by okurz over 6 years ago

  • Subject changed from [tools] Better logging and error handling in case of remote console connections in backends, e.g. ssh to [tools] Better logging and error handling in case of remote console connections in consoles or backends, e.g. ssh

Mainly yes. But at least the backend/baseclass.pm has the message bmwqemu::diag "Connection to $args{username}\@$args{hostname} established" if $ssh->auth_ok; already and at least "ssh" is mentioned a lot in backends. I thought of "backends" as the generic term for backends and consoles but sorry if this was misunderstood.

Actions #16

Updated by coolo over 6 years ago

Well, but the ssh you're having problems with is running in an xterm and is assert_screened. The failed assert_screen is the feedback

Actions #17

Updated by coolo over 6 years ago

From https://openqa.suse.de/tests/1677493/file/autoinst-log.txt

20564
[2018-05-09T11:59:16.0547 CEST] [debug] activate_console, console: root-console, type: console
[2018-05-09T11:59:16.0547 CEST] [debug] backend s390x || zkvm
[2018-05-09T11:59:16.0547 CEST] [debug] /var/lib/openqa/cache/tests/sle/tests/boot/boot_to_desktop.pm:27 called opensusebasetest::wait_boot
[2018-05-09T11:59:16.0548 CEST] [debug] <<< testapi::assert_screen(mustmatch='password-prompt', timeout=30)
[2018-05-09T11:59:16.0603 CEST] [debug] MATCH(password-prompt-20141126:0.00)
[2018-05-09T11:59:16.0616 CEST] [debug] MATCH(password-prompt-20150730:0.00)
[2018-05-09T11:59:16.0628 CEST] [info] Collected unknown process with pid 20564 and exit status: 0

But there is more in the journal:

May 09 11:58:24 openqaw2 worker[20496]: [info] Collected unknown process with pid 20318 and exit status: 1
May 09 11:58:24 openqaw2 worker[20496]: [info] Collected unknown process with pid 20320 and exit status: 84
May 09 11:58:24 openqaw2 worker[20496]: [info] Collected unknown process with pid 20322 and exit status: 13
May 09 11:58:24 openqaw2 worker[20496]: [info] Collected unknown process with pid 20325 and exit status: 0
May 09 11:58:24 openqaw2 worker[20496]: [info] Collected unknown process with pid 20326 and exit status: 255
May 09 11:59:16 openqaw2 worker[20496]: [info] Collected unknown process with pid 20564 and exit status: 0
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20562 and exit status: 1
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20292 and exit status: 0
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20441 and exit status: 1
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20535 and exit status: 0
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20446 and exit status: 84
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20455 and exit status: 0
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20443 and exit status: 84
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20456 and exit status: 255
May 09 11:59:48 openqaw2 worker[20496]: [info] Collected unknown process with pid 20401 and exit status: 0

Actions #18

Updated by coolo over 6 years ago

the 84 are xterm's exit code for IO error - so they are easy to spot :)

This happens whenever we close the xvnc console. I guess we should take care of more processes within the consoles, so we know when they die before we want them to.

Actions #19

Updated by okurz over 6 years ago

coolo wrote:

the 84 are xterm's exit code for IO error - so they are easy to spot :)

good to know

I guess we should take care of more processes within the consoles, so we know when they die before we want them to.

Yes, that would be great!

Actions #20

Updated by oorlov over 6 years ago

  • Related to deleted (action #34837: [sle12][functional][u][s390x-kvm] test fails in consoletest_finish while switching from console to GUI desktop, stuck on black screen)
Actions #21

Updated by oorlov over 6 years ago

  • Blocks action #36268: [sle][functional][u][hard][s390x] test fails in consoletest_finish - screen does not show desktop session added
Actions #22

Updated by mgriessmeier over 6 years ago

  • Status changed from Workable to In Progress
Actions #23

Updated by mgriessmeier over 6 years ago

  • Status changed from In Progress to Resolved
  • Assignee set to coolo
Actions #24

Updated by coolo over 6 years ago

  • Status changed from Resolved to Workable

this is just a hotpatch after suggestion from okurz. as it seems to work well, we should upstream that

Actions #25

Updated by coolo over 6 years ago

  • Status changed from Workable to Resolved
Actions

Also available in: Atom PDF