Project

General

Profile

Actions

action #19350

closed

[sle][functional][s390x][zkvm][hard] make unavailable ssh based zkvm consoles more obvious in the backend (was: [consistent] unable to switch to text terminal in consoletest_setup -> bsc#1040606)

Added by okurz almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Enhancement to existing tests
Start date:
2017-05-24
Due date:
2018-01-17
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-s390x-gnome+proxy-SCC+allmodules@zkvm fails in
consoletest_setup
still showing a black screen from X11 but not a text terminal.

Reproducible

Fails since (at least) Build 0399 (current job)
first time occurence Looking at https://openqa.suse.de/tests/960939#previous, fails 5/5.

Expected result

Last good: 0396 (or more recent)

Problem

  • H1: switching too fast to text terminal when x11 systemd target has not been fully reached yet
  • H2: SUT is under load, taking too long to switch to text terminal
  • H3: REJECTED SSH-Server does not start -> SUT is not reachable -> bsc#1040606
  • H4: ACCEPTED Backend hangs in autotest:263 my $rsp = myjsonrpc::read_json($isotovideo);

Further details

Always latest result in this scenario: latest


Related issues 3 (0 open3 closed)

Blocks openQA Tests - coordination #23650: [sle][functional][ipmi][epic][u] Fix test suite gnome to work on ipmi 12-SP3 and 15 (WAS: test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?)ResolvedSLindoMansilla2017-10-20

Actions
Blocks openQA Tests - action #20022: [sle][functional][zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or diedResolvedmgriessmeier2017-06-232017-10-25

Actions
Copied to openQA Tests - action #25414: [sle][functional][s390x][zvm] improve debugging for zvm only shows black screen in reconnect_s390 and error is hidden in "expect_3270" outputResolvedokurz

Actions
Actions #1

Updated by okurz almost 7 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

Retriggered, should check later: https://openqa.suse.de/tests/957462

Actions #2

Updated by okurz almost 7 years ago

  • Subject changed from [sle][functional][s390x][zkvm][check_sporadic] unable to switch to text terminal in consoletest_setup to [sle][functional][s390x][zkvm][consistent] unable to switch to text terminal in consoletest_setup
  • Description updated (diff)
  • Assignee deleted (okurz)
Actions #3

Updated by okurz almost 7 years ago

  • Subject changed from [sle][functional][s390x][zkvm][consistent] unable to switch to text terminal in consoletest_setup to [sle][functional][s390x][zkvm] make unavailable ssh based zkvm consoles more obvious in the backend (was: [consistent] unable to switch to text terminal in consoletest_setup -> bsc#1040606)
  • Category changed from Bugs in existing tests to Enhancement to existing tests
  • Assignee set to mgriessmeier

As mgriessmeier and me found out the problem is actually the same as described in bsc#1040606 so it is this bug.

Still, the problem is very misleading because not many would expect the ssh daemon in the SUT to be not working when we can not switch to a text terminal within the consoletest_setup test module.

Actions #4

Updated by mgriessmeier over 6 years ago

@okurz: how to proceed here?

Actions #5

Updated by okurz over 6 years ago

  • Due date set to 2017-09-27

added to current sprint backlog

Actions #6

Updated by mgriessmeier over 6 years ago

  • Description updated (diff)
Actions #7

Updated by mgriessmeier over 6 years ago

  • Copied to action #25414: [sle][functional][s390x][zvm] improve debugging for zvm only shows black screen in reconnect_s390 and error is hidden in "expect_3270" output added
Actions #8

Updated by mgriessmeier over 6 years ago

Suggestions:

  • retrigger https://openqa.suse.de/tests/1174158 to reproduce "considering VNC stalled, no update for 4.25 seconds" messages in log
  • modify redefine_svirt_domain.pm to disable_vnc_stalls on the svirt console and see if it improves the log-output

  • modify wait_serial to look on multiple messages to provide more specific exit-criterias

Actions #9

Updated by okurz over 6 years ago

  • Blocks coordination #23650: [sle][functional][ipmi][epic][u] Fix test suite gnome to work on ipmi 12-SP3 and 15 (WAS: test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?) added
Actions #10

Updated by mgriessmeier over 6 years ago

  • Status changed from In Progress to Feedback

helpful PRs to get more information about the incompletes were merged meanwhile:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3604
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3600

no obvious incompletes spotted in the last few builds, so lowering priority and setting to feedback

Actions #11

Updated by okurz over 6 years ago

  • Target version set to Milestone 11
Actions #12

Updated by okurz over 6 years ago

We did find another instance: https://openqa.suse.de/tests/1179206

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3612 prepared to fix that.

I'm not quite sure if maybe this will break special scenarios like ssh-X and such but maybe we should just try and then see about that.

Actions #13

Updated by okurz over 6 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from mgriessmeier to okurz

ok so after the latest PR merged we have retriggered a whole medium again for SLE15 tests and couldn't find any related problems but also some of the tests still don't ever progress that far. Also I ran "create_hdd_minimal_base+sdk@zkvm" 20 times without any problems. So I cloned sle-12-SP3-Server-DVD-s390x-Build0473-gnome+proxy_SCC+allmodules@zkvm now to check for further problems if any.

EDIT: Well that experiment failed early with "nothing provides product(SLES) = 12 needed by product:sle-module-adv-systems-management-12-0.s390x" so either all the modules on SCC are now broken or I did something wrong.

Actions #14

Updated by okurz over 6 years ago

  • Status changed from In Progress to Resolved

ok, when taking a look at the latest SLE15 build we don't find any incompletes. I assume we improved here but can reopen when we see same symptoms.

Actions #15

Updated by okurz over 6 years ago

  • Blocks action #20022: [sle][functional][zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or died added
Actions #16

Updated by okurz over 6 years ago

  • Subject changed from [sle][functional][s390x][zkvm] make unavailable ssh based zkvm consoles more obvious in the backend (was: [consistent] unable to switch to text terminal in consoletest_setup -> bsc#1040606) to [sle][functional][s390x][zkvm][hard] make unavailable ssh based zkvm consoles more obvious in the backend (was: [consistent] unable to switch to text terminal in consoletest_setup -> bsc#1040606)
  • Due date deleted (2017-09-27)
  • Status changed from Resolved to In Progress
  • Assignee deleted (okurz)
  • Target version deleted (Milestone 11)

Seems we are not done here. See https://openqa.suse.de/tests/1269751#step/boot_to_desktop/19

Here we try to select the root-console, i.e. connect to the SUT over SSH, which is reproducibly not working in 2/2 jobs after retriggering in this scenario.

Actions #17

Updated by okurz over 6 years ago

  • Assignee set to okurz

The one working in before is already 14d old so I retriggered an older one to test if that still works or if something in the build caused it:

$ openqa_clone_job_osd --skip-chained-deps 1251541 TEST=okurz_reproduce_reconnect_problem _GROUP=0

https://openqa.suse.de/tests/1269791

died as incomplete, I don't understand the error: https://openqa.suse.de/tests/1269791/file/autoinst-log.txt :(

Actions #18

Updated by okurz over 6 years ago

  • Due date set to 2017-12-19
  • Assignee deleted (okurz)
  • Target version set to Milestone 12

Seems we have to revisit this beast…

Actions #19

Updated by okurz over 6 years ago

  • Due date changed from 2017-12-19 to 2018-01-17
Actions #20

Updated by mgriessmeier over 6 years ago

  • Assignee set to mgriessmeier
Actions #21

Updated by mgriessmeier over 6 years ago

  • Description updated (diff)

I've debugged this issue further together with nick:

Still sporadic, though really good reproducible on local instance:

  1. Successful run:
    $self->run_cmd("! virsh dominfo $vmname | grep -w 'shut off'");
    gets executed multiple times until the return code = 1

  2. Incompletes
    $self->run_cmd("! virsh dominfo $vmname | grep -w 'shut off'");
    gets executed once, return code is 0, which is fine, because the machine isn't shut down yet

After this it hangs, see Debug output:

[2018-01-03T15:06:11.0114 CET] [debug] Connection to root@s390p8.suse.de established
[2018-01-03T15:06:11.0248 CET] [debug] Command executed: ! virsh dominfo openQA-SUT-7 | grep -w 'shut off', ret=0
[2018-01-03T15:06:11.0248 CET] [debug] === AFTER READ, rsp: HASH(0x66b2948)
[2018-01-03T15:06:11.0248 CET] [debug] === IS_SHUTDOWN RETURN VALUE: 0
         Stopping Restore /run/initramfs on shutdown...
         Stopping Command Scheduler...
         Stopping SuSEfirewall2 phase 2...
         Stopping User Manager for UID 0...
         Stopping Session 1 of user root.
[  OK  ] Stopped Serial Getty on ttysclp0.
[  OK  ] Stopped Serial Getty on sclp_line0.
[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped OpenSSH Daemon.
[  OK  ] Stopped Command Scheduler.
[  OK  ] Stopped User Manager for UID 0.
[  OK  ] Stopped Restore /run/initramfs on shutdown.
[  OK  ] Stopped Session 1 of user root.
[  OK  ] Removed slice User Slice of root.
         Stopping Login Service...
         Stopping Postfix Mail Transport Agent...
         Stopping Permit User Sessions...
[  OK  ] Stopped /etc/init.d/boot.local Compatibility.
[  OK  ] Removed slice system-serial\x2dgetty.slice.
[  OK  ] Stopped Login Service.
[  OK  ] Stopped Login and scanning of iSCSI devices.
[  OK  ] Stopped target Network is Online.
[  OK  ] Stopped Load kdump kernel and initrd.
[  OK  ] Started Show Plymouth Power Off Screen.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Remote File Systems (Pre).
[  OK  ] Stopped target User and Group Name Lookups.
[  OK  ] Stopped Load kdump kernel early on startup.
[  OK  ] Stopped SuSEfirewall2 phase 2.
[  OK  ] Stopped Postfix Mail Transport Agent.
[  OK  ] Stopped target Network.
         Stopping wicked managed network interfaces...
[  OK  ] Stopped target Host and Network Name Lookups.
         Stopping Name Service Cache Daemon...
[  OK  ] Stopped Name Service Cache Daemon.
[  OK  ] Stopped wicked managed network interfaces.
         Stopping wicked network nanny service...
[  OK  ] Stopped wicked network nanny service.
         Stopping wicked network management service daemon...
[  OK  ] Stopped wicked network management service daemon.
         Stopping wicked DHCPv4 supplicant service...
         Stopping wicked AutoIPv4 supplicant service...
         Stopping wicked DHCPv6 supplicant service...
[  OK  ] Stopped wicked DHCPv6 supplicant service.
[  OK  ] Stopped wicked AutoIPv4 supplicant service.
[  OK  ] Stopped wicked DHCPv4 supplicant service.
[  OK  ] Stopped SuSEfirewall2 phase 1.
         Stopping D-Bus System Message Bus...
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Sockets.
[  OK  ] Closed Open-iSCSI iscsid Socket.
[  OK  ] Closed PC/SC Smart Card Daemon Activation Socket.
[  OK  ] Closed Syslog Socket.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped target Slices.
[  OK  ] Removed slice User and Session Slice.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
         Stopping Load/Save Random Seed...
[  OK  ] Stopped target Swap.
         Deactivating swap /dev/disk/by-path/ccw-0.0.0000-part2...
[  OK  ] Stopped Update is Completed.
[  OK  ] Stopped Rebuild Journal Catalog.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped Load Kernel Modules.
         Stopping Update UTMP about System Boot/Shutdown...
[  OK  ] Stopped target Encrypted Volumes.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped Rebuild Hardware Database.
         Stopping LSB: Set Control Program Identification data...
[  OK  ] Deactivated swap /dev/disk/by-uuid/...b3c-d26c-443a-9ba2-964b7360e65d.
[  OK  ] Deactivated swap /dev/vda2.
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Deactivated swap /dev/disk/by-path/ccw-0.0.0000-part2.
[  OK  ] Stopped Update UTMP about System Boot/Shutdown.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped Flush Journal to Persistent Storage.
[  OK  ] Stopped target Local File Systems.
         Unmounting /opt...
         Unmounting /var/tmp...
         Unmounting /var/lib/pgsql...
         Unmounting /var/lib/mariadb...
         Unmounting /var/lib/mailman...
         Unmounting /srv...
         Unmounting /var/cache...
         Unmounting /var/lib/libvirt/images...
         Unmounting /var/crash...
         Unmounting /usr/local...
         Unmounting /var/lib/machines...
         Unmounting /var/lib/mysql...
         Unmounting /run/user/0...
         Unmounting /.snapshots...
         Unmounting /var/spool...
         Unmounting /var/lib/named...
         Unmounting /home...
         Unmounting /tmp...
         Unmounting /boot/zipl...
         Unmounting /var/opt...
         Unmounting /boot/grub2/s390x-emu...
         Unmounting /var/log...
[  OK  ] Unmounted /opt.
[  OK  ] Unmounted /var/tmp.
[  OK  ] Unmounted /var/lib/pgsql.
[  OK  ] Unmounted /var/lib/mariadb.
[  OK  ] Unmounted /var/lib/mailman.
[  OK  ] Unmounted /srv.
[  OK  ] Unmounted /var/cache.
[  OK  ] Unmounted /var/lib/libvirt/images.
[  OK  ] Unmounted /var/crash.
[  OK  ] Unmounted /usr/local.
[  OK  ] Unmounted /var/lib/machines.
[  OK  ] Unmounted /var/lib/mysql.
[  OK  ] Unmounted /run/user/0.
[  OK  ] Unmounted /.snapshots.
[  OK  ] Unmounted /var/spool.
[  OK  ] Unmounted /var/lib/named.
[  OK  ] Unmounted /home.
[  OK  ] Unmounted /tmp.
[  OK  ] Unmounted /var/opt.
[  OK  ] Unmounted /boot/zipl.
[  OK  ] Unmounted /boot/grub2/s390x-emu.
[  OK  ] Unmounted /var/log.
[  OK  ] Stopped File System Check on /dev/disk/by-path/ccw-0.0.0000-part1.
[  OK  ] Removed slice system-systemd\x2dfsck.slice.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Stopped target Local File Systems (Pre).
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create System Users.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Stopped LSB: Set Control Program Identification data.
[  OK  ] Reached target Shutdown.

[2018-01-03T15:06:12.0249 CET] [debug] === ITERATION ASSERT_SHUTDOWN with remaining timeout 59
Use of uninitialized value $args in concatenation (.) or string at /home/os-autoinst/autotest.pm line 250.
[2018-01-03T15:06:12.0249 CET] [debug] === QUERY_ISOTOVIDEO CALLED, CMD: backend_is_shutdown, ARGS: 
[2018-01-03T15:06:12.0249 CET] [debug] === BEFORE SEND, CMD: backend_is_shutdown, JSON: backend_is_shutdown
[2018-01-03T15:06:12.0249 CET] [debug] === BEFORE READ, isotovideo: GLOB(0x1881938)

so, one can see that the issue is not in the testcode, since the while loop in assert_shutdown is iterated again, but then the backend hangs in query_isotovideo('backend_is_shutdown')

Actions #22

Updated by mgriessmeier over 6 years ago

applied mnowaks PR https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4107 and triggered 20 jobs on o.s.d to gather some statistics

Actions #23

Updated by mgriessmeier over 6 years ago

  • Status changed from In Progress to Feedback

18/18 runs successful - no incompletes with PR https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4107
setting to feedback until PR is merged

Actions #24

Updated by mgriessmeier over 6 years ago

PR got merged and deployed
https://openqa.suse.de/tests/1371294 as verification run

Actions #25

Updated by mgriessmeier over 6 years ago

  • Status changed from Feedback to Resolved

https://openqa.suse.de/tests/1371296 successful on o.s.d

I know that this issue was sporadic and one verification run is probably not sufficient, but taking the other ~5 tests i did manually into account, I consider this as resolved
If you see any incompletes on s390-kvm in shutdown again, please reopen

Actions #26

Updated by okurz over 6 years ago

thank you, good job!

Actions

Also available in: Atom PDF