Project

General

Profile

Actions

action #23814

closed

[sle][functional][hard][sle15]remote_ssh_controller fails to connect to the client via ssh

Added by Anonymous over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Start date:
2017-08-31
Due date:
2018-02-13
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-Leanos-DVD-x86_64-remote_ssh_controller@64bit fails in
boot_to_desktop

Reproducible

Fails every time for sLE15

Expected result

Last good: The corresponding reference jobs from SLE 12 SP3 GM are remote_ssh_controller and remote_ssh_target_ftp

Further details

Always latest result in this scenario: latest

Actions #1

Updated by okurz over 6 years ago

  • Subject changed from test module boots into SLE12 SP1 although it is supposed to be a test for SLE15 to [sle][functional][sle15]remote_ssh_target_ftp fails to wait for remote controller connection (was: test module boots into SLE12 SP1 although it is supposed to be a test for SLE15)

seems like you jumped to the wrong conclusion. I suggest to always compare against the last working, in this case that would be the corresponding SLE 12 SP3 test case. The test boots an existing installation, the installed version does not really matter, then connects to the client over network to conduct a ssh connection. I labeled the client scenario with the same progress issue and updated the subject line.

Actions #2

Updated by okurz over 6 years ago

  • Due date set to 2017-09-27

added to sprint backlog

Actions #3

Updated by Anonymous over 6 years ago

The connection to the client via ssh couldn't be established. I'm investigating the reason for that.

Actions #4

Updated by Anonymous over 6 years ago

  • Subject changed from [sle][functional][sle15]remote_ssh_target_ftp fails to wait for remote controller connection (was: test module boots into SLE12 SP1 although it is supposed to be a test for SLE15) to [sle][functional][sle15]remote_ssh_controller fails to wait for remote controller connection
Actions #5

Updated by Anonymous over 6 years ago

Olli, you mentioned that you labeled the client scenario with the same progress issue and updated the subject line. Which should be the client scenario?

Actions #6

Updated by Anonymous over 6 years ago

  • Subject changed from [sle][functional][sle15]remote_ssh_controller fails to wait for remote controller connection to [sle][functional][sle15]remote_ssh_controller fails to connect to the client via ssh
Actions #7

Updated by Anonymous over 6 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Anonymous over 6 years ago

  • Category deleted (Bugs in existing tests)

Scenarios of installation via ssh or vnc: remote_ssh_controller, remote_ssh_target_ftp, remote_vnc_controller, remote_vnc_target_nfs all failed because the client didn't boot properly.

Actions #9

Updated by Anonymous over 6 years ago

  • Status changed from In Progress to Feedback

I'll test it with physical machine next week and update the ticket, if it is bug in our tests or a product bug.

Actions #10

Updated by Anonymous over 6 years ago

  • Assignee set to Anonymous
Actions #11

Updated by okurz over 6 years ago

  • Description updated (diff)
  • Category set to Bugs in existing tests

yi wrote:

Olli, you mentioned that you labeled the client scenario with the same progress issue and updated the subject line. Which should be the client scenario?

Sorry, I meant "child scenario".

The corresponding reference jobs from SLE 12 SP3 GM are remote_ssh_controller and remote_ssh_target_ftp, updated description.

Comparing the evaluated test variables from the corresponding scenarios for the latest failed SLE15 vs. the last good SLE 12 SP3 with diff -Naur <(curl -s https://openqa.suse.de/tests/1171122/file/vars.json) <(curl -s https://openqa.suse.de/tests/1058714/file/vars.json) I found the following important differences:

+   "NETBOOT" : "1",
-   "REMOTE_CONTROLLER" : "vnc",
+   "REMOTE_CONTROLLER" : "ssh",

So the controller job for SLE15 seems to miss the variable NETBOOT which has an impact on the test folow as well as not specifying REMOTE_CONTROLLER=ssh.

Previously the variable "NETBOOT" has been specified for the Server-MINI-ISO which does not apply for SLE15 so it needs to be evaluated if we need this variable on the test suites instead.

About REMOTE_CONTROLLER=ssh I found out now that this has been set by you, yi, when triggering the job manually :-) So that was a bit misleading to me now. One of the last one with REMOTE_CONTROLLER=ssh was https://openqa.suse.de/tests/1167615#step/remote_controller/32 which really looks like it should work, at least it is doing the right step, trying to connect to the other machine with ssh-call. For better debugging I suggest to debug the ssh connection, e.g. call ssh -vvvv instead of normal ssh in tests/remote/remote_controller.pm and give it a bit more time in the following "assert_screen". Maybe we should replace the whole section:

        type_string "ssh root\@$target_ip\n";
        if (!check_screen('remote-ssh-login')) {
            type_string "ssh -vvvv root\@$target_ip\n";
            assert_screen('remote-ssh-login', 600);
        } 
        type_string "yes\n";

And also I suggest to
1) try it out manually first
2) crosscheck if the test still works for SLE 12 SP3.

Actions #12

Updated by okurz over 6 years ago

  • Target version set to Milestone 12
Actions #13

Updated by okurz over 6 years ago

zluo has a running multimachine test environment so please try to clone and run these tests locally

Actions #14

Updated by zluo over 6 years ago

cloned the job: http://e13.suse.de/tests/4205#

the issue is on support site, first_boot failed

Actions #15

Updated by zluo over 6 years ago

support server status
http://e13.suse.de/tests/4204

Actions #16

Updated by mgriessmeier over 6 years ago

looks like target cannot be installed because there was some code-change apparently and it looks for the wrong needles now

https://openqa.suse.de/tests/1180149# compared to SP3 https://openqa.suse.de/tests/1058714#

Actions #17

Updated by Anonymous over 6 years ago

Till Build260.4 the test remote_ssh_controller can finish boot_to_desktop and reach remote_controller test module:
http://openqa.suse.de/tests/1173562

However the cloned test of this job also fails at boot_to_desktop:
http://f146.suse.de/tests/1212

Actions #18

Updated by Anonymous over 6 years ago

The needle 'linux-login' was removed from the matching list, thus boot_to_desktop failed. I'll test with the needle added in the list.

Actions #19

Updated by Anonymous over 6 years ago

  • Status changed from Feedback to In Progress
Actions #20

Updated by Anonymous over 6 years ago

  • Status changed from In Progress to Feedback

Installation via FTP doesn't work from my test machine. http://f146.suse.de/tests/1231
Click OK also doesn't help. This error message will pop up again.

Actions #21

Updated by Anonymous over 6 years ago

I created PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3628
Maybe someone else can continue test with it.

Actions #22

Updated by okurz over 6 years ago

  • Due date changed from 2017-09-27 to 2017-10-11

should be completable in next sprint

Actions #23

Updated by Anonymous over 6 years ago

With PR adapted and needles adjusted/created, the test can proceed on zluo's test machine:
http://e13.suse.de/tests/4266
http://e13.suse.de/tests/4265

Both have passed.

Actions #25

Updated by Anonymous over 6 years ago

Please merge PR so we can close the ticket. Thanks.

Actions #26

Updated by Anonymous over 6 years ago

  • Status changed from Feedback to In Progress
Actions #27

Updated by okurz over 6 years ago

The needles MR is merged but I don't want to merge the test PR. Maybe just a variable like "TEXTMODE=1" is missing on the scenario or something but it's not the right way to just accept the text login prompt.

Please evaluate if the problem is SLE15 specifc or if it also does not work anymore for SLE12

Actions #28

Updated by okurz over 6 years ago

  • Target version changed from Milestone 12 to Milestone 11

moving to M11 as it's in sprint1 now

Actions #29

Updated by Anonymous over 6 years ago

There is a variable VIDEOMODE=text in the setttings for this scenario. I compared the settings with SLE12SP3, there' no suspicious changes.

Actions #30

Updated by Anonymous over 6 years ago

Olli, please take a look of the testrun for SLE12SP3: http://openqa.suse.de/tests/1058714#step/boot_to_desktop/2

At this stage, it looks for exactly these three needles: emergency-mode, emergency-shell, linux-login. And for SLE15 the test failed, because someone removed linux-login, instead put display-manager in the queue. What I did in my PR is to let it be back in the previous status. I don't see a reason that it shouldn't be merged.

Actions #31

Updated by okurz over 6 years ago

yi wrote:

The needle 'linux-login' was removed from the matching list, thus boot_to_desktop failed. I'll test with the needle added in the list.

So I think we agreed that the needle was not removed from the matching list but the test code and/or test settings changed.

I triggered tests with explicitly set DESKTOP=textmode

$ openqa_clone_job_osd 1187284 DESKTOP=textmode TEST=okurz_poo#23814_triggered_with_videomode_text
Cloning dependencies of sle-15-Leanos-DVD-x86_64-Build278.1-remote_ssh_controller@64bit
Created job #1190375: sle-15-Leanos-DVD-x86_64-Build278.1-remote_ssh_target_ftp@64bit -> https://openqa.suse.de/t1190375
Created job #1190376: sle-15-Leanos-DVD-x86_64-Build278.1-remote_ssh_controller@64bit -> https://openqa.suse.de/t1190376

The test now succeeds to login to the text session so I conclude that we need the variable DESKTOP=textmode on the testsuite. I added this now to the testsuite remote_ssh_controller.

Now we are back to the old problem in https://openqa.suse.de/tests/1190376#step/remote_controller/32 , the ssh connection does not succeed. So back to my suggestion in #23814#note-11

Actions #32

Updated by Anonymous over 6 years ago

Recent testruns(after removing of a workaround needle):
https://openqa.suse.de/tests/1193261
https://openqa.suse.de/tests/1193261

Actions #33

Updated by zluo over 6 years ago

We tried a couple time to verify the issue. Fixed locally the needle inst-textselected which was not matched, and re-run the test., but without success.

http://e13.suse.de/tests/4323#step/remote_target/1

send_key 'ret' seems doesn't to work here, the support-server fails at stage remote-target.

Actions #34

Updated by zluo over 6 years ago

@okurz

installation_ready is not available to remote_controller even after I added following changes (your suggestion) in remote_controller.pm:

--
type_string "ssh root\@$target_ip\n";
assert_screen "remote-ssh-login";
if (!check_screen('remote-ssh-login')) {
type_string "ssh -vvvv root\@$target_ip\n";
assert_screen('remote-ssh-login', 600);
}

type_string "yes\n";

same issue at stage remote-target:

http://e13.suse.de/tests/4327#step/remote_target/1

I use following command to clone the job:

sudo /usr/share/openqa/script/clone_job.pl --from http://openqa.suse.de 1193332 VNC_TYPING_LIMIT=40 DESKTOP=textmode NETBOOT=1 REMOTE_CONTROLLER=ssh

Actions #35

Updated by zluo over 6 years ago

check the needle tag again, add inst-textselected-with_colormenu to inst-textselected because esc is needed after textmode selected and send_key 'ret' works then after this change.

However it stops at stage installation after the installation got started:

http://e13.suse.de/tests/4331#step/remote_target/5

Actions #36

Updated by zluo over 6 years ago

  • Assignee changed from Anonymous to zluo
Actions #37

Updated by zluo over 6 years ago

// found ftp path. it seems that ftp path is wrong, otherwise URL page for installation won't be pop up.

Actions #38

Updated by zluo over 6 years ago

well strange thing is // is working in browser however this is wrong.

the problem on my local machine is ftp port closed, so open ftp port 21, it looks then good:

http://e13.suse.de/tests/4335#step/remote_target/1

Actions #40

Updated by zluo over 6 years ago

  • Status changed from In Progress to Resolved

osd reference test shows that this is working fine:

https://openqa.suse.de/tests/1195279

set as resolved for now.

Actions #41

Updated by okurz over 6 years ago

  • Status changed from Resolved to In Progress

good work so far. The target fails to boot into the expected graphical session though: https://openqa.suse.de/tests/1195278#step/first_boot/6 . Certainly not the same issue as the original one but a followup. I suggest to try to fix that within this sprint within this ticket. If not possible please create another ticket accordingly and link it appropriately.

Actions #42

Updated by zluo over 6 years ago

@okurz
okay, will check this then

Actions #43

Updated by zluo over 6 years ago

https://openqa.suse.de/tests/1196678#step/remote_controller/32

it seem that login itself has problem for remote-controller.

possible network issue on osd.

compare with my local server:

http://e13.suse.de/tests/4363#step/remote_controller/31

remote-ssh-login-20160511.json is there...

Actions #44

Updated by zluo over 6 years ago

https://openqa.suse.de/tests/1195278#step/first_boot/6

shows that first_boot progress to prompt login.
this is actually already okay. the question why displaymanager (a gui needle) is needed?

the issue above is a different issue and needs to be solved at first.

Actions #45

Updated by zluo over 6 years ago

compare with http://e13.suse.de/tests/4265#step/first_boot/1 (9 days ago)

it made steps successfully to first_boot in gui.

Actions #46

Updated by zluo over 6 years ago

change first_boot.pm for the case when it boots up in textmode (text-login needle is required).

assert_screen [qw(displaymanager emergency-shell emergency-mode text-login)], $boot_timeout;

fixed and it looks at: http://e13.suse.de/tests/4370#step/first_boot/1

Actions #48

Updated by okurz over 6 years ago

  • Assignee changed from zluo to okurz

The PR seems wrong to me as it (again) tries to look for "text-login" when we expect a graphical session. Maybe I was wrong in #23814#note-31 to conclude that we need to change DESKTOP=textmode as well. Maybe only VIDEOMODE=text should be changed. I triggered tests with DESKTOP=gnome:

$ openqa_clone_job_osd 1196769 _GROUP=0 DESKTOP=gnome
Cloning dependencies of sle-15-Leanos-DVD-x86_64-Build280.1-remote_ssh_controller@64bit

-> Created job #1197393: sle-15-Leanos-DVD-x86_64-Build280.1-remote_ssh_target_ftp@64bit -> https://openqa.suse.de/t1197393
-> Created job #1197394: sle-15-Leanos-DVD-x86_64-Build280.1-remote_ssh_controller@64bit -> https://openqa.suse.de/t1197394

Actions #49

Updated by okurz over 6 years ago

that again failed in the controller assuming the desktop to be gnome on initial bootup. So we need to ensure that "change_desktop" is not called but the desktop of the controller is still textmode, hm.

Actions #50

Updated by okurz over 6 years ago

  • Due date deleted (2017-10-11)

It's a more generic problem when we have two machines that we need to care about. The "controller" has to stay in DESKTOP=textmode, the "target" has to be DESKTOP=gnome. Maybe we can introduce a new test variable, e.g. TARGET_DESKTOP or adjust the test method "default_desktop" for this scenario.

Actions #51

Updated by okurz over 6 years ago

  • Assignee deleted (okurz)

sorry, not working on this one right now.

Actions #52

Updated by Anonymous over 6 years ago

  • Assignee set to Anonymous
Actions #53

Updated by okurz over 6 years ago

  • Due date set to 2017-11-08
Actions #54

Updated by Anonymous over 6 years ago

  • Status changed from In Progress to Feedback

Currently I couldn't run the test on my machine, Mattias tried to help and couldn't find out the reason for it. He also ran the test on his machine, however had the same issue, that the worker couldn't access the FTP installation source.

Actions #55

Updated by okurz over 6 years ago

  • Due date changed from 2017-11-08 to 2017-11-22
  • Status changed from Feedback to In Progress
  • Assignee deleted (Anonymous)
  • Target version changed from Milestone 11 to Milestone 12

Should be testable by anyone that has access to a proper multimachine test environment, let's try again in the next sprint.

The availability of multimachine test environment is a common problem for all of us where we should improve.

Actions #56

Updated by okurz over 6 years ago

  • Due date deleted (2017-11-22)
Actions #57

Updated by okurz over 6 years ago

  • Due date set to 2018-01-30
  • Target version changed from Milestone 12 to Milestone 13
Actions #58

Updated by thehejik over 6 years ago

The problem was introduced by adding more multimachine workers and connecting them together by using GRE tunnels.

When we were using only one multimachine worker the test didn't use any VLAN in ovs-switch but it worked because it was within one worker and the same ovs-switch.

Now I've set NETWORKS=fixed to remote_ssh* and remote_vnc* tests in osd which should assign VMs to the same VLAN on ovs-switch.

Actions #59

Updated by thehejik over 6 years ago

  • Status changed from In Progress to Resolved
  • Assignee set to thehejik
  • % Done changed from 0 to 100

We introduced 2 new variables to remote_ssh test: NETWORKS=fixed, DESKTOP=textmode.

There are three another problems with remote_{ssh,vnc} tests:

  • I will set MTU=1458 for VMs to be able pass through GRE tunnel
  • Martin Loviska is working on fix for first_boot in remote_ssh issue - it doesn't work probably due removing grub2 countdown in SLE15 - already done
  • Martin Kravec will fix a problem with select_console "x11" in remote_vnc - problem probably is that controller image is SLE12SP1 but the test is set to VERSION=15 so the console switching doesn't work properly - it sends alt+f2 instead of alt-f7 for SLE12.
Actions #60

Updated by thehejik over 6 years ago

  • Status changed from Resolved to In Progress
  • % Done changed from 100 to 50
Actions #61

Updated by thehejik over 6 years ago

  • Assignee changed from thehejik to mloviska

@mloviska

  • use MTU=1458 for controller and for target as well.
    • Ideally switch to regular support server for controller and use dhcp,dns services - replace static IP configuration with DHCP
  • https://progress.opensuse.org/issues/30228
Actions #62

Updated by okurz about 6 years ago

  • Due date changed from 2018-01-30 to 2018-02-13
  • Target version changed from Milestone 13 to Milestone 14
Actions #63

Updated by okurz about 6 years ago

  • Subject changed from [sle][functional][sle15]remote_ssh_controller fails to connect to the client via ssh to [sle][functional][hard][sle15]remote_ssh_controller fails to connect to the client via ssh
Actions #64

Updated by okurz about 6 years ago

In between we had SLE15 ssh tests that worked but now we have an incomplete: https://openqa.suse.de/tests/1440078

Stuck in

[2018-02-01T00:01:46.0202 CET] [debug] ||| starting remote_controller tests/remote/remote_controller.pm
[2018-02-01T00:01:46.0228 CET] [debug] mutex lock 'installation_ready'
[2018-02-01T00:01:46.0258 CET] [debug] mutex lock 'installation_ready' unavailable, sleeping 5s

I recommend you check this again :) Could be a race condition?

Actions #65

Updated by mkravec about 6 years ago

Parent job timed out after 2 hours waiting for child.
Child job says: "State: cancelled finished about 9 hours ago ( 0 )" - seems it never started because there was no free worker (I'm guessing here)

Both were restarted and passed fine. Snippet from autoinst-log.txt is normal behavior of locked mutex.
I think this issue is resolved, wdyt?

Actions #66

Updated by mloviska about 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100
Actions

Also available in: Atom PDF