Project

General

Profile

Actions

action #41504

closed

[s390x][functional][u] test fails in svirt_upload_assets - adjust backend to use keys instead of passwords - "evil sleep"

Added by nicksinger over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
SUSE QA - Milestone 20
Start date:
2018-09-24
Due date:
2018-10-23
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP1-Installer-DVD-s390x-create_hdd_minimal_base+sdk@zkvm fails in
svirt_upload_assets because it enters the password for the svirt console to early.
We should adjust the backend here to fully rely on ssh key-based authentication. This avoids waiting and checking for password prompts.


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #42668: [sle][functional][hyperv] test fails in bootloader_zkvm & bootloader_uefi: Implement svirt password checkResolvedmichalnowak2018-10-18

Actions
Blocks openQA Tests - action #42029: [sle][functional][u] - test fails in reconnect_s390 - race condition when switching to svirt console and password promptResolvedokurz2018-10-052018-11-06

Actions
Actions #1

Updated by zluo over 5 years ago

  • Status changed from Workable to In Progress
  • Assignee set to zluo

take over

Actions #2

Updated by zluo over 5 years ago

problem with zKVM machine:

[2018-09-25T13:10:38.0713 CEST] [debug] [autotest] process exited: 0
[2018-09-25T13:10:40.0714 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 89.

[2018-09-25T13:10:40.0719 CEST] [debug] commands process exited: 0
[2018-09-25T13:10:41.0720 CEST] [debug] isotovideo done
[2018-09-25T13:10:41.0787 CEST] [debug] Connection to root@s390p8.suse.de established
[2018-09-25T13:10:41.0898 CEST] [debug] Command executed: ! virsh dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
[2018-09-25T13:10:41.0898 CEST] [debug] BACKEND SHUTDOWN 0
[2018-09-25T13:10:41.0902 CEST] [debug] Destroying openQA-SUT-4 virtual machine
[2018-09-25T13:10:42.0181 CEST] [debug] Connection to root@s390p8.suse.de established
[2018-09-25T13:10:42.0788 CEST] [debug] Command's stdout:
Domain openQA-SUT-4 destroyed

--

I changed worker.ini so that it could find openQA-SUT-4 at least. But after Domain openQA-SUT-4 destroyed, the test still failed:

XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":43972" after 8701 requests (8469 known processed) with 0 events remaining.

Actions #3

Updated by zluo over 5 years ago

the ideas:

ssh-copy-id public key from vm to zKVM server, this can be added before first_boot.

or

use kernel boot option to hand over public key to zKVM server.

Actions #4

Updated by okurz over 5 years ago

Maybe I misunderstand your approach but I think this is only about ssh keys between the worker hosts and s390pb and s390p8, not the VMs

Actions #5

Updated by okurz over 5 years ago

  • Category set to 132
  • Target version set to Milestone 19
Actions #6

Updated by zluo over 5 years ago

Need access to osd worker host for ssh-copy-id or someone else can do this for me.

Actions #7

Updated by coolo over 5 years ago

Do what for you? I rather avoid having even more admins - we failed just yesterday with that strategy

Actions #8

Updated by okurz over 5 years ago

As I told every test logs into the hosts with a password so you can do the same with the same password. But I would rather wait to discuss this idea in a bit bigger scope. Can you propose what you want to do in the ticket in written form please?

Actions #9

Updated by zluo over 5 years ago

  • Status changed from In Progress to Feedback

@okurz the basic idea is ssh-copy-id key from from worker host to zKVM server. Or do you have other idea? We can discuss this offline, of course.

Actions #10

Updated by zluo over 5 years ago

@coolo I don't need access to osd server at all. Someone can do this for me for sure. But I don't think the situation with too many admins has to do this ticket.

Actions #11

Updated by zluo over 5 years ago

  • Assignee deleted (zluo)
Actions #12

Updated by okurz over 5 years ago

  • Status changed from Feedback to Workable

zluo wrote:

@okurz the basic idea is ssh-copy-id key from from worker host to zKVM server. Or do you have other idea? We can discuss this offline, of course.

Sounds like the right direction but my question would be "which key?". I can think of the following approaches:

  1. Create a key for the worker dynamically, copy that once using password authentication and from then only rely on key authentication -> problem is that keys would pile up on s390p8/b
  2. Create a static private key, store it with salt recipes, let the salt recipes put the key on the worker and manually put the public key on s390p8/b
  3. As 2 but also cover the whole key management on s390p8/b with salt

Probably we should go with 2 given that s390p8/b are not maintained by salt AFAIK unless some salt expert can hack some "put key on remote ssh server using password authentication" salt within 5-10 minutes.

Actions #13

Updated by okurz over 5 years ago

  • Blocks action #42029: [sle][functional][u] - test fails in reconnect_s390 - race condition when switching to svirt console and password prompt added
Actions #14

Updated by okurz over 5 years ago

  • Subject changed from [s390x][functional][u] test fails in svirt_upload_assets - adjust backend to use keys instead of passwords to [s390x][functional][u] test fails in svirt_upload_assets - adjust backend to use keys instead of passwords - "evil sleep"
Actions #15

Updated by jorauch over 5 years ago

  • Assignee set to jorauch

Taking a look

Tasks:

  • get backend to use ssh keys instead of passwd
  • place public key on worker via salt
  • place private key on LPAR manually (The keys maybe other way around)
Actions #16

Updated by okurz over 5 years ago

So I think this ticket is mainly about the FIXME: assert_screen in sshVirtsh.pm

To my current understanding the sleep 3 is a workaround for what one tried with an assert_screen within the backend code which we most likely can not do because we are not within tests. Hence the idea was to use ssh-key authentication. This should be testable with the s390x backend as we are in an early stage of tests and do not really care about even openQA but can just use isotovideo locally. My suggestion is as follows: Download vars.json from failing test and call isotovideo.

Actions #17

Updated by jorauch over 5 years ago

When adapting this in sshVirtsh.pm we should be very careful not to break other tests

Actions #18

Updated by okurz over 5 years ago

  • Due date changed from 2018-10-09 to 2018-10-23
Actions #19

Updated by jorauch over 5 years ago

  • Status changed from Workable to In Progress

I could prove that it works with local (on pinky.arch.suse.de) isotovideo and s390p7
Due to the other local ssh key the file has to be passed as parameter, but in general it works
[2018-10-09T13:15:09.0824 CEST] [debug] Connection to root@s390pb.suse.de established

Actions #20

Updated by jorauch over 5 years ago

Regarding the salt receipe we should consider to to explicitly add the identity file in ~/.ssh/config to avoid collisions with other keys that might be on the worker
We should also not forget about X Forwarding if something fails

Actions #21

Updated by okurz over 5 years ago

alright, this sounds good so far. Do you have some changes ready for a (WIP-)PR? Regarding the addition to salt I suggest to talk to your room colleague nicksinger

Actions #22

Updated by jorauch over 5 years ago

The problem is in consoles/console.pm sshCommand which defines the parameters passed to ssh and contains -o PubkeyAuthentication=no
We could also have a problem with the baseclass in consoles/sshXtermVt.pm which also calls an ssh terminal

Actions #23

Updated by jorauch over 5 years ago

We now have a first dirty solution:
http://pinky.arch.suse.de/tests/1500#

This still needs a lot of cleanup and abstraction but it works

Following files and functions have been touched so far:

  • backend/baseclass: new_ssh_connection (changes should not be necessary)
  • consoles/sshVirtsh.pm: activate (removed one ssh call, did not encounter problems, recheck)
  • consoles/sshXtermVt.pm: activate (added the key as parameter for a ssh call)
  • consoles/console.pm: sshCommand (added the key parameter)

In summary we can say that the issue was in sshXtermVt and the password was typed in sshVirtsh

Actions #24

Updated by jorauch over 5 years ago

WIP PR:
https://github.com/os-autoinst/os-autoinst/pull/1038

Since ~ is not being expanded correctly we should now think about where we can place the file and get started with salt and adding the key to the LPARs.
The original synchronization issue is still valid, but at least we now do not have the risk of tests failing due to race conditions anymore

Actions #25

Updated by okurz over 5 years ago

  • Target version changed from Milestone 19 to Milestone 20
Actions #26

Updated by jorauch over 5 years ago

  • Target version deleted (Milestone 20)

We could try to wait on the console level for SSH to finish, just like ssh root@s390p7 && echo "done" this could block the test until the connection is up

http://pinky.arch.suse.de/tests/1507#step/bootloader_zkvm/5

That did not work out as expected

Actions #27

Updated by okurz over 5 years ago

@jorauch I suggest to put this task on hold until you find a "pair" or "mob" to program in to collect more crazy ideas and move forward :)

Actions #28

Updated by okurz over 5 years ago

  • Target version set to Milestone 20
Actions #29

Updated by jorauch over 5 years ago

Simply including assert_screen did not work out, I would like to check whether assert_screen can be adapted.
Will ask in daily for help and try to increase collaboration with mgriessmeier

We should try to replace the sleep with a callback for s390 login (like HyperV consoleswitch) which can be a simple assert_screen for the beginning

Actions #30

Updated by jorauch over 5 years ago

  • Status changed from In Progress to Feedback
Actions #31

Updated by jorauch over 5 years ago

  • Status changed from Feedback to In Progress

OSD did already break for zkvm:
https://openqa.suse.de/tests/2177940#step/bootloader_zkvm/3
So revisiting this

Actions #32

Updated by jorauch over 5 years ago

  • Status changed from In Progress to Feedback

According to foursixnine this is executing only half of the merges as stated on GitHub, so will keep monitoring this

We should demo this at the next review

Actions #33

Updated by okurz over 5 years ago

  • Status changed from Feedback to In Progress
  • Assignee changed from jorauch to okurz
  • Priority changed from Normal to Urgent

So we have both the os-autoinst code as well as test code deployed and broke all tests in bootloader_zkvm, e.g. https://openqa.suse.de/tests/2185917#step/bootloader_zkvm/11

I will see what I can do about it. The verification run looked fine so I wonder what changed now…

Actions #34

Updated by jorauch over 5 years ago

Any progress here?
Are there different LPAR versions that behave different than the ones used for verification?
Are really both PRs deployed? When looking at this last time Santi wanted to un-deploy the test changes until the next deploy of os-autoinst
It looks to me like we are missing a xterm window, might be it is the one I deleted? But locally this worked fine
It also happens locally:
http://pinky.arch.suse.de/tests/1592#live

Actions #35

Updated by jorauch over 5 years ago

Sorry for the spam, apparently the code I removed was not that dead for SLE12, when readding it the test works fine: http://pinky.arch.suse.de/tests/1593
It also does not break sle15: http://pinky.arch.suse.de/tests/1594
A revert should fix it short term and we can investigate it further

Actions #36

Updated by jorauch over 5 years ago

  • Related to action #42668: [sle][functional][hyperv] test fails in bootloader_zkvm & bootloader_uefi: Implement svirt password check added
Actions #37

Updated by szarate over 5 years ago

@jorauch both prs should be deployed by now

Actions #38

Updated by jorauch over 5 years ago

I wonder whether mnowaks PR will interfere here or even solve our problem:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5999

Actions #39

Updated by jorauch over 5 years ago

  • Status changed from In Progress to Feedback

The PR did fix the problem:
https://openqa.suse.de/tests/2193126#

I think we can close this now?

Actions #40

Updated by okurz over 5 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from okurz to jorauch

Well, I guess we are not really using ssh keys in the end but still, yes, I consider ourselves done because we deleted a FIXME and sleep. Thank you for taking care!

Actions

Also available in: Atom PDF