action #41504
closed[s390x][functional][u] test fails in svirt_upload_assets - adjust backend to use keys instead of passwords - "evil sleep"
0%
Description
Observation¶
openQA test in scenario sle-15-SP1-Installer-DVD-s390x-create_hdd_minimal_base+sdk@zkvm fails in
svirt_upload_assets because it enters the password for the svirt console to early.
We should adjust the backend here to fully rely on ssh key-based authentication. This avoids waiting and checking for password prompts.
Updated by zluo about 6 years ago
- Status changed from Workable to In Progress
- Assignee set to zluo
take over
Updated by zluo about 6 years ago
problem with zKVM machine:
[2018-09-25T13:10:38.0713 CEST] [debug] [autotest] process exited: 0
[2018-09-25T13:10:40.0714 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 89.
[2018-09-25T13:10:40.0719 CEST] [debug] commands process exited: 0
[2018-09-25T13:10:41.0720 CEST] [debug] isotovideo done
[2018-09-25T13:10:41.0787 CEST] [debug] Connection to root@s390p8.suse.de established
[2018-09-25T13:10:41.0898 CEST] [debug] Command executed: ! virsh dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
[2018-09-25T13:10:41.0898 CEST] [debug] BACKEND SHUTDOWN 0
[2018-09-25T13:10:41.0902 CEST] [debug] Destroying openQA-SUT-4 virtual machine
[2018-09-25T13:10:42.0181 CEST] [debug] Connection to root@s390p8.suse.de established
[2018-09-25T13:10:42.0788 CEST] [debug] Command's stdout:
Domain openQA-SUT-4 destroyed
--
I changed worker.ini so that it could find openQA-SUT-4 at least. But after Domain openQA-SUT-4 destroyed, the test still failed:
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":43972" after 8701 requests (8469 known processed) with 0 events remaining.
Updated by zluo about 6 years ago
the ideas:
ssh-copy-id public key from vm to zKVM server, this can be added before first_boot.
or
use kernel boot option to hand over public key to zKVM server.
Updated by okurz about 6 years ago
Maybe I misunderstand your approach but I think this is only about ssh keys between the worker hosts and s390pb and s390p8, not the VMs
Updated by okurz about 6 years ago
- Category set to 132
- Target version set to Milestone 19
Updated by zluo about 6 years ago
Need access to osd worker host for ssh-copy-id or someone else can do this for me.
Updated by coolo about 6 years ago
Do what for you? I rather avoid having even more admins - we failed just yesterday with that strategy
Updated by okurz about 6 years ago
As I told every test logs into the hosts with a password so you can do the same with the same password. But I would rather wait to discuss this idea in a bit bigger scope. Can you propose what you want to do in the ticket in written form please?
Updated by zluo about 6 years ago
- Status changed from In Progress to Feedback
@okurz the basic idea is ssh-copy-id key from from worker host to zKVM server. Or do you have other idea? We can discuss this offline, of course.
Updated by zluo about 6 years ago
@coolo I don't need access to osd server at all. Someone can do this for me for sure. But I don't think the situation with too many admins has to do this ticket.
Updated by okurz about 6 years ago
- Status changed from Feedback to Workable
zluo wrote:
@okurz the basic idea is ssh-copy-id key from from worker host to zKVM server. Or do you have other idea? We can discuss this offline, of course.
Sounds like the right direction but my question would be "which key?". I can think of the following approaches:
- Create a key for the worker dynamically, copy that once using password authentication and from then only rely on key authentication -> problem is that keys would pile up on s390p8/b
- Create a static private key, store it with salt recipes, let the salt recipes put the key on the worker and manually put the public key on s390p8/b
- As 2 but also cover the whole key management on s390p8/b with salt
Probably we should go with 2 given that s390p8/b are not maintained by salt AFAIK unless some salt expert can hack some "put key on remote ssh server using password authentication" salt within 5-10 minutes.
Updated by okurz about 6 years ago
- Blocks action #42029: [sle][functional][u] - test fails in reconnect_s390 - race condition when switching to svirt console and password prompt added
Updated by okurz about 6 years ago
- Subject changed from [s390x][functional][u] test fails in svirt_upload_assets - adjust backend to use keys instead of passwords to [s390x][functional][u] test fails in svirt_upload_assets - adjust backend to use keys instead of passwords - "evil sleep"
Updated by jorauch about 6 years ago
- Assignee set to jorauch
Taking a look
Tasks:
- get backend to use ssh keys instead of passwd
- place public key on worker via salt
- place private key on LPAR manually (The keys maybe other way around)
Updated by okurz about 6 years ago
So I think this ticket is mainly about the FIXME: assert_screen in sshVirtsh.pm
To my current understanding the sleep 3
is a workaround for what one tried with an assert_screen
within the backend code which we most likely can not do because we are not within tests. Hence the idea was to use ssh-key authentication. This should be testable with the s390x backend as we are in an early stage of tests and do not really care about even openQA but can just use isotovideo
locally. My suggestion is as follows: Download vars.json from failing test and call isotovideo
.
Updated by jorauch about 6 years ago
When adapting this in sshVirtsh.pm we should be very careful not to break other tests
Updated by okurz about 6 years ago
- Due date changed from 2018-10-09 to 2018-10-23
Updated by jorauch about 6 years ago
- Status changed from Workable to In Progress
I could prove that it works with local (on pinky.arch.suse.de) isotovideo and s390p7
Due to the other local ssh key the file has to be passed as parameter, but in general it works
[2018-10-09T13:15:09.0824 CEST] [debug] Connection to root@s390pb.suse.de established
Updated by jorauch about 6 years ago
Regarding the salt receipe we should consider to to explicitly add the identity file in ~/.ssh/config to avoid collisions with other keys that might be on the worker
We should also not forget about X Forwarding if something fails
Updated by okurz about 6 years ago
alright, this sounds good so far. Do you have some changes ready for a (WIP-)PR? Regarding the addition to salt I suggest to talk to your room colleague nicksinger
Updated by jorauch about 6 years ago
The problem is in consoles/console.pm sshCommand which defines the parameters passed to ssh and contains -o PubkeyAuthentication=no
We could also have a problem with the baseclass in consoles/sshXtermVt.pm which also calls an ssh terminal
Updated by jorauch about 6 years ago
We now have a first dirty solution:
http://pinky.arch.suse.de/tests/1500#
This still needs a lot of cleanup and abstraction but it works
Following files and functions have been touched so far:
- backend/baseclass: new_ssh_connection (changes should not be necessary)
- consoles/sshVirtsh.pm: activate (removed one ssh call, did not encounter problems, recheck)
- consoles/sshXtermVt.pm: activate (added the key as parameter for a ssh call)
- consoles/console.pm: sshCommand (added the key parameter)
In summary we can say that the issue was in sshXtermVt and the password was typed in sshVirtsh
Updated by jorauch about 6 years ago
WIP PR:
https://github.com/os-autoinst/os-autoinst/pull/1038
Since ~ is not being expanded correctly we should now think about where we can place the file and get started with salt and adding the key to the LPARs.
The original synchronization issue is still valid, but at least we now do not have the risk of tests failing due to race conditions anymore
Updated by okurz about 6 years ago
- Target version changed from Milestone 19 to Milestone 20
Updated by jorauch about 6 years ago
- Target version deleted (
Milestone 20)
We could try to wait on the console level for SSH to finish, just like ssh root@s390p7 && echo "done"
this could block the test until the connection is up
http://pinky.arch.suse.de/tests/1507#step/bootloader_zkvm/5
That did not work out as expected
Updated by okurz about 6 years ago
@jorauch I suggest to put this task on hold until you find a "pair" or "mob" to program in to collect more crazy ideas and move forward :)
Updated by jorauch about 6 years ago
Simply including assert_screen did not work out, I would like to check whether assert_screen can be adapted.
Will ask in daily for help and try to increase collaboration with mgriessmeier
We should try to replace the sleep with a callback for s390 login (like HyperV consoleswitch) which can be a simple assert_screen for the beginning
Updated by jorauch about 6 years ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/os-autoinst/pull/1038
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5962
Have been created and merged, now waiting for OSD to break
Updated by jorauch about 6 years ago
- Status changed from Feedback to In Progress
OSD did already break for zkvm:
https://openqa.suse.de/tests/2177940#step/bootloader_zkvm/3
So revisiting this
Updated by jorauch about 6 years ago
- Status changed from In Progress to Feedback
According to foursixnine this is executing only half of the merges as stated on GitHub, so will keep monitoring this
We should demo this at the next review
Updated by okurz about 6 years ago
- Status changed from Feedback to In Progress
- Assignee changed from jorauch to okurz
- Priority changed from Normal to Urgent
So we have both the os-autoinst code as well as test code deployed and broke all tests in bootloader_zkvm, e.g. https://openqa.suse.de/tests/2185917#step/bootloader_zkvm/11
I will see what I can do about it. The verification run looked fine so I wonder what changed now…
Updated by jorauch about 6 years ago
Any progress here?
Are there different LPAR versions that behave different than the ones used for verification?
Are really both PRs deployed? When looking at this last time Santi wanted to un-deploy the test changes until the next deploy of os-autoinst
It looks to me like we are missing a xterm window, might be it is the one I deleted? But locally this worked fine
It also happens locally:
http://pinky.arch.suse.de/tests/1592#live
Updated by jorauch about 6 years ago
Sorry for the spam, apparently the code I removed was not that dead for SLE12, when readding it the test works fine: http://pinky.arch.suse.de/tests/1593
It also does not break sle15: http://pinky.arch.suse.de/tests/1594
A revert should fix it short term and we can investigate it further
Updated by jorauch about 6 years ago
- Related to action #42668: [sle][functional][hyperv] test fails in bootloader_zkvm & bootloader_uefi: Implement svirt password check added
Updated by szarate about 6 years ago
@jorauch both prs should be deployed by now
Updated by jorauch about 6 years ago
I wonder whether mnowaks PR will interfere here or even solve our problem:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5999
Updated by jorauch about 6 years ago
- Status changed from In Progress to Feedback
The PR did fix the problem:
https://openqa.suse.de/tests/2193126#
I think we can close this now?
Updated by okurz about 6 years ago
- Status changed from Feedback to Resolved
- Assignee changed from okurz to jorauch
Well, I guess we are not really using ssh keys in the end but still, yes, I consider ourselves done because we deleted a FIXME and sleep. Thank you for taking care!