action #23814
closed[sle][functional][hard][sle15]remote_ssh_controller fails to connect to the client via ssh
100%
Description
Observation¶
openQA test in scenario sle-15-Leanos-DVD-x86_64-remote_ssh_controller@64bit fails in
boot_to_desktop
Reproducible¶
Fails every time for sLE15
Expected result¶
Last good: The corresponding reference jobs from SLE 12 SP3 GM are remote_ssh_controller and remote_ssh_target_ftp
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 7 years ago
- Subject changed from test module boots into SLE12 SP1 although it is supposed to be a test for SLE15 to [sle][functional][sle15]remote_ssh_target_ftp fails to wait for remote controller connection (was: test module boots into SLE12 SP1 although it is supposed to be a test for SLE15)
seems like you jumped to the wrong conclusion. I suggest to always compare against the last working, in this case that would be the corresponding SLE 12 SP3 test case. The test boots an existing installation, the installed version does not really matter, then connects to the client over network to conduct a ssh connection. I labeled the client scenario with the same progress issue and updated the subject line.
Updated by Anonymous over 7 years ago
The connection to the client via ssh couldn't be established. I'm investigating the reason for that.
Updated by Anonymous over 7 years ago
- Subject changed from [sle][functional][sle15]remote_ssh_target_ftp fails to wait for remote controller connection (was: test module boots into SLE12 SP1 although it is supposed to be a test for SLE15) to [sle][functional][sle15]remote_ssh_controller fails to wait for remote controller connection
Updated by Anonymous over 7 years ago
Olli, you mentioned that you labeled the client scenario with the same progress issue and updated the subject line. Which should be the client scenario?
Updated by Anonymous over 7 years ago
- Subject changed from [sle][functional][sle15]remote_ssh_controller fails to wait for remote controller connection to [sle][functional][sle15]remote_ssh_controller fails to connect to the client via ssh
Updated by Anonymous over 7 years ago
- Category deleted (
Bugs in existing tests)
Scenarios of installation via ssh or vnc: remote_ssh_controller, remote_ssh_target_ftp, remote_vnc_controller, remote_vnc_target_nfs all failed because the client didn't boot properly.
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Feedback
I'll test it with physical machine next week and update the ticket, if it is bug in our tests or a product bug.
Updated by okurz over 7 years ago
- Description updated (diff)
- Category set to Bugs in existing tests
yi wrote:
Olli, you mentioned that you labeled the client scenario with the same progress issue and updated the subject line. Which should be the client scenario?
Sorry, I meant "child scenario".
The corresponding reference jobs from SLE 12 SP3 GM are remote_ssh_controller and remote_ssh_target_ftp, updated description.
Comparing the evaluated test variables from the corresponding scenarios for the latest failed SLE15 vs. the last good SLE 12 SP3 with diff -Naur <(curl -s https://openqa.suse.de/tests/1171122/file/vars.json) <(curl -s https://openqa.suse.de/tests/1058714/file/vars.json)
I found the following important differences:
+ "NETBOOT" : "1",
- "REMOTE_CONTROLLER" : "vnc",
+ "REMOTE_CONTROLLER" : "ssh",
So the controller job for SLE15 seems to miss the variable NETBOOT which has an impact on the test folow as well as not specifying REMOTE_CONTROLLER=ssh.
Previously the variable "NETBOOT" has been specified for the Server-MINI-ISO which does not apply for SLE15 so it needs to be evaluated if we need this variable on the test suites instead.
About REMOTE_CONTROLLER=ssh I found out now that this has been set by you, yi, when triggering the job manually :-) So that was a bit misleading to me now. One of the last one with REMOTE_CONTROLLER=ssh was https://openqa.suse.de/tests/1167615#step/remote_controller/32 which really looks like it should work, at least it is doing the right step, trying to connect to the other machine with ssh-call. For better debugging I suggest to debug the ssh connection, e.g. call ssh -vvvv
instead of normal ssh
in tests/remote/remote_controller.pm and give it a bit more time in the following "assert_screen". Maybe we should replace the whole section:
type_string "ssh root\@$target_ip\n";
if (!check_screen('remote-ssh-login')) {
type_string "ssh -vvvv root\@$target_ip\n";
assert_screen('remote-ssh-login', 600);
}
type_string "yes\n";
And also I suggest to
1) try it out manually first
2) crosscheck if the test still works for SLE 12 SP3.
Updated by okurz over 7 years ago
zluo has a running multimachine test environment so please try to clone and run these tests locally
Updated by zluo over 7 years ago
cloned the job: http://e13.suse.de/tests/4205#
the issue is on support site, first_boot failed
Updated by zluo over 7 years ago
support server status
http://e13.suse.de/tests/4204
Updated by mgriessmeier over 7 years ago
looks like target cannot be installed because there was some code-change apparently and it looks for the wrong needles now
https://openqa.suse.de/tests/1180149# compared to SP3 https://openqa.suse.de/tests/1058714#
Updated by Anonymous over 7 years ago
Till Build260.4 the test remote_ssh_controller can finish boot_to_desktop and reach remote_controller test module:
http://openqa.suse.de/tests/1173562
However the cloned test of this job also fails at boot_to_desktop:
http://f146.suse.de/tests/1212
Updated by Anonymous over 7 years ago
The needle 'linux-login' was removed from the matching list, thus boot_to_desktop failed. I'll test with the needle added in the list.
Updated by Anonymous over 7 years ago
- Status changed from Feedback to In Progress
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Feedback
Installation via FTP doesn't work from my test machine. http://f146.suse.de/tests/1231
Click OK also doesn't help. This error message will pop up again.
Updated by Anonymous over 7 years ago
I created PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3628
Maybe someone else can continue test with it.
Updated by okurz over 7 years ago
- Due date changed from 2017-09-27 to 2017-10-11
should be completable in next sprint
Updated by Anonymous about 7 years ago
With PR adapted and needles adjusted/created, the test can proceed on zluo's test machine:
http://e13.suse.de/tests/4266
http://e13.suse.de/tests/4265
Both have passed.
Updated by zluo about 7 years ago
needles PR: https://gitlab.suse.de/openqa/os-autoinst-needles-sles/merge_requests/503
Please merge, thanks!
Updated by Anonymous about 7 years ago
Please merge PR so we can close the ticket. Thanks.
Updated by Anonymous about 7 years ago
- Status changed from Feedback to In Progress
Updated by okurz about 7 years ago
The needles MR is merged but I don't want to merge the test PR. Maybe just a variable like "TEXTMODE=1" is missing on the scenario or something but it's not the right way to just accept the text login prompt.
Please evaluate if the problem is SLE15 specifc or if it also does not work anymore for SLE12
Updated by okurz about 7 years ago
- Target version changed from Milestone 12 to Milestone 11
moving to M11 as it's in sprint1 now
Updated by Anonymous about 7 years ago
There is a variable VIDEOMODE=text in the setttings for this scenario. I compared the settings with SLE12SP3, there' no suspicious changes.
Updated by Anonymous about 7 years ago
Olli, please take a look of the testrun for SLE12SP3: http://openqa.suse.de/tests/1058714#step/boot_to_desktop/2
At this stage, it looks for exactly these three needles: emergency-mode, emergency-shell, linux-login. And for SLE15 the test failed, because someone removed linux-login, instead put display-manager in the queue. What I did in my PR is to let it be back in the previous status. I don't see a reason that it shouldn't be merged.
Updated by okurz about 7 years ago
yi wrote:
The needle 'linux-login' was removed from the matching list, thus boot_to_desktop failed. I'll test with the needle added in the list.
So I think we agreed that the needle was not removed from the matching list but the test code and/or test settings changed.
I triggered tests with explicitly set DESKTOP=textmode
$ openqa_clone_job_osd 1187284 DESKTOP=textmode TEST=okurz_poo#23814_triggered_with_videomode_text
Cloning dependencies of sle-15-Leanos-DVD-x86_64-Build278.1-remote_ssh_controller@64bit
Created job #1190375: sle-15-Leanos-DVD-x86_64-Build278.1-remote_ssh_target_ftp@64bit -> https://openqa.suse.de/t1190375
Created job #1190376: sle-15-Leanos-DVD-x86_64-Build278.1-remote_ssh_controller@64bit -> https://openqa.suse.de/t1190376
The test now succeeds to login to the text session so I conclude that we need the variable DESKTOP=textmode
on the testsuite. I added this now to the testsuite remote_ssh_controller.
Now we are back to the old problem in https://openqa.suse.de/tests/1190376#step/remote_controller/32 , the ssh connection does not succeed. So back to my suggestion in #23814#note-11
Updated by Anonymous about 7 years ago
Recent testruns(after removing of a workaround needle):
https://openqa.suse.de/tests/1193261
https://openqa.suse.de/tests/1193261
Updated by zluo about 7 years ago
We tried a couple time to verify the issue. Fixed locally the needle inst-textselected which was not matched, and re-run the test., but without success.
http://e13.suse.de/tests/4323#step/remote_target/1
send_key 'ret' seems doesn't to work here, the support-server fails at stage remote-target.
Updated by zluo about 7 years ago
installation_ready is not available to remote_controller even after I added following changes (your suggestion) in remote_controller.pm:
--
type_string "ssh root\@$target_ip\n";
assert_screen "remote-ssh-login";
if (!check_screen('remote-ssh-login')) {
type_string "ssh -vvvv root\@$target_ip\n";
assert_screen('remote-ssh-login', 600);
}
type_string "yes\n";¶
same issue at stage remote-target:
http://e13.suse.de/tests/4327#step/remote_target/1
I use following command to clone the job:
sudo /usr/share/openqa/script/clone_job.pl --from http://openqa.suse.de 1193332 VNC_TYPING_LIMIT=40 DESKTOP=textmode NETBOOT=1 REMOTE_CONTROLLER=ssh
Updated by zluo about 7 years ago
check the needle tag again, add inst-textselected-with_colormenu to inst-textselected because esc is needed after textmode selected and send_key 'ret' works then after this change.
However it stops at stage installation after the installation got started:
Updated by zluo about 7 years ago
// found ftp path. it seems that ftp path is wrong, otherwise URL page for installation won't be pop up.
Updated by zluo about 7 years ago
well strange thing is // is working in browser however this is wrong.
the problem on my local machine is ftp port closed, so open ftp port 21, it looks then good:
Updated by zluo about 7 years ago
Updated by zluo about 7 years ago
- Status changed from In Progress to Resolved
osd reference test shows that this is working fine:
https://openqa.suse.de/tests/1195279
set as resolved for now.
Updated by okurz about 7 years ago
- Status changed from Resolved to In Progress
good work so far. The target fails to boot into the expected graphical session though: https://openqa.suse.de/tests/1195278#step/first_boot/6 . Certainly not the same issue as the original one but a followup. I suggest to try to fix that within this sprint within this ticket. If not possible please create another ticket accordingly and link it appropriately.
Updated by zluo about 7 years ago
https://openqa.suse.de/tests/1196678#step/remote_controller/32
it seem that login itself has problem for remote-controller.
possible network issue on osd.
compare with my local server:
http://e13.suse.de/tests/4363#step/remote_controller/31
remote-ssh-login-20160511.json is there...
Updated by zluo about 7 years ago
https://openqa.suse.de/tests/1195278#step/first_boot/6
shows that first_boot progress to prompt login.
this is actually already okay. the question why displaymanager (a gui needle) is needed?
the issue above is a different issue and needs to be solved at first.
Updated by zluo about 7 years ago
compare with http://e13.suse.de/tests/4265#step/first_boot/1 (9 days ago)
it made steps successfully to first_boot in gui.
Updated by zluo about 7 years ago
change first_boot.pm for the case when it boots up in textmode (text-login needle is required).
assert_screen [qw(displaymanager emergency-shell emergency-mode text-login)], $boot_timeout;
fixed and it looks at: http://e13.suse.de/tests/4370#step/first_boot/1
Updated by zluo about 7 years ago
Updated by okurz about 7 years ago
- Assignee changed from zluo to okurz
The PR seems wrong to me as it (again) tries to look for "text-login" when we expect a graphical session. Maybe I was wrong in #23814#note-31 to conclude that we need to change DESKTOP=textmode as well. Maybe only VIDEOMODE=text should be changed. I triggered tests with DESKTOP=gnome:
$ openqa_clone_job_osd 1196769 _GROUP=0 DESKTOP=gnome
Cloning dependencies of sle-15-Leanos-DVD-x86_64-Build280.1-remote_ssh_controller@64bit
-> Created job #1197393: sle-15-Leanos-DVD-x86_64-Build280.1-remote_ssh_target_ftp@64bit -> https://openqa.suse.de/t1197393
-> Created job #1197394: sle-15-Leanos-DVD-x86_64-Build280.1-remote_ssh_controller@64bit -> https://openqa.suse.de/t1197394
Updated by okurz about 7 years ago
that again failed in the controller assuming the desktop to be gnome on initial bootup. So we need to ensure that "change_desktop" is not called but the desktop of the controller is still textmode, hm.
Updated by okurz about 7 years ago
- Due date deleted (
2017-10-11)
It's a more generic problem when we have two machines that we need to care about. The "controller" has to stay in DESKTOP=textmode, the "target" has to be DESKTOP=gnome. Maybe we can introduce a new test variable, e.g. TARGET_DESKTOP or adjust the test method "default_desktop" for this scenario.
Updated by okurz about 7 years ago
- Assignee deleted (
okurz)
sorry, not working on this one right now.
Updated by Anonymous about 7 years ago
- Status changed from In Progress to Feedback
Currently I couldn't run the test on my machine, Mattias tried to help and couldn't find out the reason for it. He also ran the test on his machine, however had the same issue, that the worker couldn't access the FTP installation source.
Updated by okurz about 7 years ago
- Due date changed from 2017-11-08 to 2017-11-22
- Status changed from Feedback to In Progress
- Assignee deleted (
Anonymous) - Target version changed from Milestone 11 to Milestone 12
Should be testable by anyone that has access to a proper multimachine test environment, let's try again in the next sprint.
The availability of multimachine test environment is a common problem for all of us where we should improve.
Updated by okurz about 7 years ago
- Due date set to 2018-01-30
- Target version changed from Milestone 12 to Milestone 13
Updated by thehejik almost 7 years ago
The problem was introduced by adding more multimachine workers and connecting them together by using GRE tunnels.
When we were using only one multimachine worker the test didn't use any VLAN in ovs-switch but it worked because it was within one worker and the same ovs-switch.
Now I've set NETWORKS=fixed to remote_ssh* and remote_vnc* tests in osd which should assign VMs to the same VLAN on ovs-switch.
Updated by thehejik almost 7 years ago
- Status changed from In Progress to Resolved
- Assignee set to thehejik
- % Done changed from 0 to 100
We introduced 2 new variables to remote_ssh test: NETWORKS=fixed, DESKTOP=textmode.
There are three another problems with remote_{ssh,vnc} tests:
- I will set MTU=1458 for VMs to be able pass through GRE tunnel
- Martin Loviska is working on fix for first_boot in remote_ssh issue - it doesn't work probably due removing grub2 countdown in SLE15 - already done
- Martin Kravec will fix a problem with select_console "x11" in remote_vnc - problem probably is that controller image is SLE12SP1 but the test is set to VERSION=15 so the console switching doesn't work properly - it sends alt+f2 instead of alt-f7 for SLE12.
Updated by thehejik almost 7 years ago
- Status changed from Resolved to In Progress
- % Done changed from 100 to 50
Updated by thehejik almost 7 years ago
- Assignee changed from thehejik to mloviska
- use MTU=1458 for controller and for target as well.
- Ideally switch to regular support server for controller and use dhcp,dns services - replace static IP configuration with DHCP
- https://progress.opensuse.org/issues/30228
Updated by okurz almost 7 years ago
- Due date changed from 2018-01-30 to 2018-02-13
- Target version changed from Milestone 13 to Milestone 14
Updated by okurz almost 7 years ago
- Subject changed from [sle][functional][sle15]remote_ssh_controller fails to connect to the client via ssh to [sle][functional][hard][sle15]remote_ssh_controller fails to connect to the client via ssh
Updated by okurz almost 7 years ago
In between we had SLE15 ssh tests that worked but now we have an incomplete: https://openqa.suse.de/tests/1440078
Stuck in
[2018-02-01T00:01:46.0202 CET] [debug] ||| starting remote_controller tests/remote/remote_controller.pm
[2018-02-01T00:01:46.0228 CET] [debug] mutex lock 'installation_ready'
[2018-02-01T00:01:46.0258 CET] [debug] mutex lock 'installation_ready' unavailable, sleeping 5s
I recommend you check this again :) Could be a race condition?
Updated by mkravec almost 7 years ago
Parent job timed out after 2 hours waiting for child.
Child job says: "State: cancelled finished about 9 hours ago ( 0 )" - seems it never started because there was no free worker (I'm guessing here)
Both were restarted and passed fine. Snippet from autoinst-log.txt is normal behavior of locked mutex.
I think this issue is resolved, wdyt?
Updated by mloviska almost 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100