action #23650

[sle][functional][ipmi][epic][u] Fix test suite gnome to work on ipmi 12-SP3 and 15 (WAS: test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?)

Added by okurz over 2 years ago. Updated about 1 month ago.

Status:BlockedStart date:20/10/2017
Priority:HighDue date:
Assignee:zluo% Done:

86%

Category:Bugs in existing tests
Target version:SUSE QA tests - Milestone 30
Difficulty:
Duration:

Description

Observation

openQA test in scenario sle-15-Leanos-DVD-x86_64-gnome@64bit-ipmi fails in
boot_from_pxe as incomplete with the message "Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused" in the autoinst-log.txt.

Expected result

The test never worked for SLE15.

Acceptance criteria

  • AC1: Test suite default is able to complete as install only for SLE12-SP5+ and SLE15-SP1+ over IPMI
  • AC2: Test suite gnome is able to complete as install only for SLE12-SP5+ and SLE15-SP1+ over IPMI

Problem

autoinst-log.txt

10:08:07.4045 Debug: /var/lib/openqa/cache/tests/sle/tests/boot/boot_from_pxe.pm:100 called testapi::select_console
10:08:07.4046 5804 <<< testapi::select_console(testapi_console='installation')
/usr/lib/os-autoinst/consoles/vnc_base.pm:64:{
  'password' => 'nots3cr3t',
  'port' => 5901,
  'hostname' => '10.162.2.87'
}
10:08:09.4100 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:10.4115 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:11.4129 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:12.4139 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:13.4150 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:14.4162 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:15.4174 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
10:08:16.4186 5808 Error connecting to host <10.162.2.87>: IO::Socket::INET: connect: Connection refused
DIE socket does not exist. Probably your backend instance could not start or died. at /usr/lib/os-autoinst/consoles/VNC.pm line 881.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
    backend::baseclass::die_handler('socket does not exist. Probably your backend instance could n...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 801
    consoles::VNC::catch {...} ('socket does not exist. Probably your backend instance could n...') called at /usr/lib/perl5/vendor_perl/5.18.2/Try/Tiny.pm line 115
    Try::Tiny::try('CODE(0x843f028)', 'Try::Tiny::Catch=REF(0x843f310)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 803
    consoles::VNC::update_framebuffer('consoles::VNC=HASH(0x8440268)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 74

That's annoying because incompletes are harder to understand and carry over can't work. (Improvement of this message is handled separately)

Further details

Always latest result in this scenario:
- latest SLE15-SP2 default(server role) (not yet available)
- latest SLE15-SP5 default
- latest SLE15-SP5 gnome
- latest SLE15-SP1 default (server role)
- latest SLE15-SP1 gnome
- latest SLE15
- former latest, Leanos-DVD


Subtasks

action #26926: [sle][functional]ipmi VNC reconnect failures cause jobs t...Resolvedszarate

action #26928: [sle][functional]gnome@64bit-ipmi using VNC installation ...Resolvedokurz

action #26948: [sle][functional][ipmi][hard] Adjust boot_from_pxe to san...ResolvedSLindoMansilla

action #32089: [sle][functional][u][ipmi][easy] test fails in first_boot...ResolvedSLindoMansilla

action #37387: [sle][functional][ipmi][u] Fix test suite gnome to work o...Rejectedokurz

action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - ...Workablexlai

action #41693: [sle][functional][u][ipmi][sporadic] test fails in boot_f...RejectedSLindoMansilla


Related issues

Related to openQA Tests - action #20022: [sle][functional][zkvm][s390] incomplete test due to sock... Resolved 23/06/2017 25/10/2017
Blocked by openQA Tests - action #19350: [sle][functional][s390x][zkvm][hard] make unavailable ssh... Resolved 24/05/2017 17/01/2018
Blocks openQA Tests - action #41207: [functional][u][ipmi] test fails in reboot_gnome - seems ... Blocked 18/09/2018
Blocked by openQA Tests - action #53249: [epic][functional][u] ensure that grub_test gets a bootin... Workable 04/11/2019

History

#2 Updated by okurz over 2 years ago

  • Blocked by action #19350: [sle][functional][s390x][zkvm][hard] make unavailable ssh based zkvm consoles more obvious in the backend (was: [consistent] unable to switch to text terminal in consoletest_setup -> bsc#1040606) added

#3 Updated by okurz over 2 years ago

  • Assignee set to nicksinger

as discussed during standup 2017-09-20

#4 Updated by okurz over 2 years ago

  • Target version set to Milestone 11

#6 Updated by okurz over 2 years ago

  • Due date set to 11/10/2017

#7 Updated by okurz over 2 years ago

The latest job does not incomplete but fail without a failed module stated. To my understanding the VNC stall detection is just a symptom, not the problem. Could be that the VNC process or a ssh terminal process died in the background and isotovideo does not check that so I suggest one of the following:

  1. (preferred) the VNC process terminating should not go unnoticed
  2. catch the "die" and handle it gracefully after the connection is terminated

#8 Updated by mgriessmeier over 2 years ago

  • Status changed from New to In Progress

Work in Progress PR created, unfortunately not as far progressed as we wanted to have it due to more important issues popping up
@okurz, nsinger: hopefully you can take this as a base to continue further in this sprint.

we turned the die into an Exception, but failed to add a record_info box - though we found a nice way to reproduce the "Socket does not exist" issue consistently (nsinger knows more about that)

https://github.com/os-autoinst/os-autoinst/pull/862

#9 Updated by mgriessmeier over 2 years ago

  • Related to action #20022: [sle][functional][zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or died added

#10 Updated by riafarov over 2 years ago

PR with fix of review comment to be able to merge it: https://github.com/os-autoinst/os-autoinst/pull/864

#11 Updated by okurz over 2 years ago

did not complete in sprint 1. main reason: spontaneous packaging training which we were not aware of in before. we have the PR which should improve user feedback a lot and this is definitely possible in the next sprint 2.

#12 Updated by okurz over 2 years ago

  • Due date changed from 11/10/2017 to 25/10/2017

#14 Updated by nicksinger over 2 years ago

@okurz helped a hell lot to form a hypothesis together with me what happens here; our current IPMI code in the test (boot/boot_from_pxe.pm) checks for a running SSHD on the SUT and continues to execute. The next step activates a console named "install" what basically means for the worker: "open a connection (whatever protocol) to the SUT and expect the yast installer there". Right now the tests expects a running VNC implicit by checking for a running SSHD. This worked for a long time but obviously does not apply anymore so the test tries to connect to early and receives a "Connection timed out".

The current approach now is to adjust the needle which checks for a running SSH/VNC dynamically based on the variable "VIDEOMODE": http://openqa.glados.qa.suse.de/tests/484#step/boot_from_pxe/24

#15 Updated by okurz over 2 years ago

screenshot does not look like there is a responsive VNC server. I suggest to crosscheck manually.

#16 Updated by nicksinger over 2 years ago

  • Description updated (diff)

#17 Updated by nicksinger over 2 years ago

Manual investigation of @okurz and me revealed the kernel parameter/console redirection as part of the cause why this fails. Removing the additional "console=tty" and just let "console=ttyS1,115200" in there results in the expected output on the serial console.

#18 Updated by okurz over 2 years ago

as discussed, please make sure the ticket is closed today

#19 Updated by okurz over 2 years ago

  • Due date set to 25/10/2017

due to changes in a related task

#20 Updated by okurz over 2 years ago

  • Subject changed from [sle][functional][ipmi]test fails in boot_from_pxe - connection refused trying to ipmi host over ssh? to [sle][functional][ipmi][epic]test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?
  • Status changed from In Progress to Feedback

I think the comment you wanted to add is "https://github.com/nicksinger/os-autoinst-distri-opensuse/commit/c7d775f042e346d00cd084bc9bf7b4df30ba7768 provides a first fix for this. Test can now continue and finds the right needle: http://openqa.glados.qa.suse.de/tests/517 . The test can still not succeed since the VPNd binds to the second interface and is therefore not reachable on the expected address." from #26038#note-14

So we failed to close this ticket today … but at least I tried to improve by creating subtickets

#21 Updated by nicksinger over 2 years ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3770 addresses the major problem we want to address here (test checks for SSH running while we really want to connect to VNC). Anyway, it will still fail and there is more improvement needed which I'll track in another ticket.

#22 Updated by okurz over 2 years ago

  • Status changed from Resolved to In Progress

#23 Updated by nicksinger over 2 years ago

  • Status changed from In Progress to Feedback

#24 Updated by okurz over 2 years ago

  • Due date changed from 25/10/2017 to 08/11/2017

due to changes in a related task

#25 Updated by okurz over 2 years ago

  • Target version changed from Milestone 11 to Milestone 12

#26 Updated by okurz about 2 years ago

  • Due date changed from 08/11/2017 to 16/01/2018

due to changes in a related task

#27 Updated by nicksinger about 2 years ago

  • Assignee deleted (nicksinger)

#28 Updated by okurz about 2 years ago

  • Due date changed from 16/01/2018 to 27/02/2018

due to changes in a related task

#29 Updated by okurz about 2 years ago

  • Target version changed from Milestone 12 to Milestone 14

#30 Updated by okurz almost 2 years ago

  • Assignee set to SLindoMansilla

#31 Updated by SLindoMansilla almost 2 years ago

  • Status changed from Feedback to In Progress

Working on sub-task

#32 Updated by riafarov almost 2 years ago

  • Due date changed from 27/02/2018 to 13/03/2018

due to changes in a related task

#33 Updated by SLindoMansilla almost 2 years ago

  • Status changed from In Progress to Resolved

All sub task resolved, all related tasks resolved.

#34 Updated by okurz almost 2 years ago

  • Status changed from Resolved to Workable

But expected result is not met: See https://openqa.suse.de/tests/1514178 failing in first_boot

#35 Updated by SLindoMansilla almost 2 years ago

Trying with DESKTOP=textmode: http://copland.arch.suse.de/tests/942

#37 Updated by mgriessmeier almost 2 years ago

  • Due date changed from 13/03/2018 to 27/03/2018

due to changes in a related task

#38 Updated by mgriessmeier almost 2 years ago

  • Target version changed from Milestone 14 to Milestone 15

#39 Updated by mgriessmeier almost 2 years ago

  • Due date changed from 27/03/2018 to 10/04/2018

due to changes in a related task

#40 Updated by okurz almost 2 years ago

  • Subject changed from [sle][functional][ipmi][epic]test fails in boot_from_pxe - connection refused trying to ipmi host over ssh? to [sle][functional][ipmi][epic][u]test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?

#41 Updated by mgriessmeier almost 2 years ago

  • Due date changed from 10/04/2018 to 24/04/2018

due to changes in a related task

#42 Updated by SLindoMansilla almost 2 years ago

  • Subject changed from [sle][functional][ipmi][epic][u]test fails in boot_from_pxe - connection refused trying to ipmi host over ssh? to [sle][functional][ipmi][epic][u] Fix test suite gnome to work on ipmi 12-SP3 and 15 (WAS: test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?)

#43 Updated by SLindoMansilla almost 2 years ago

  • Related to action #31375: [sle][functional][ipmi][u][hard] test fails in first_boot - VNC installation on SLE 15 failed because of various issues (ipmi worker, first_boot, boot_from_pxe, await_install) added

#44 Updated by mgriessmeier almost 2 years ago

  • Due date changed from 24/04/2018 to 08/05/2018

due to changes in a related task

#45 Updated by okurz almost 2 years ago

  • Target version changed from Milestone 15 to Milestone 16

correcting milestone

#46 Updated by okurz over 1 year ago

  • Related to action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all added

#47 Updated by okurz over 1 year ago

  • Description updated (diff)
  • Target version changed from Milestone 16 to Milestone 17

https://openqa.suse.de/tests/1749830 is the latest job, not exactly "working" so we are not done here.

#48 Updated by okurz over 1 year ago

  • Target version changed from Milestone 17 to Milestone 21+

#49 Updated by okurz over 1 year ago

  • Target version changed from Milestone 21+ to Milestone 21+

#50 Updated by okurz over 1 year ago

  • Status changed from Workable to Blocked

#51 Updated by SLindoMansilla over 1 year ago

  • Related to deleted (action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all)

#52 Updated by SLindoMansilla over 1 year ago

  • Blocked by action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all added

#53 Updated by SLindoMansilla over 1 year ago

  • Blocked by deleted (action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all)

#54 Updated by SLindoMansilla over 1 year ago

Sorry, I confused the subject line with the PXE. Added as subtask: #36027

#55 Updated by SLindoMansilla over 1 year ago

  • Related to deleted (action #31375: [sle][functional][ipmi][u][hard] test fails in first_boot - VNC installation on SLE 15 failed because of various issues (ipmi worker, first_boot, boot_from_pxe, await_install))

#56 Updated by SLindoMansilla over 1 year ago

Maybe related, this job restarts IPMI machines: http://jenkins.qa.suse.de/job/restart-ipmi-mainboard/

#57 Updated by okurz about 1 year ago

  • Blocks action #41207: [functional][u][ipmi] test fails in reboot_gnome - seems we call some code which we are not allowed to do, need to "reset_consoles" or something? nearly there to a complete run again :) added

#58 Updated by okurz about 1 year ago

  • Target version changed from Milestone 21+ to Milestone 24

#59 Updated by mgriessmeier 9 months ago

  • Target version changed from Milestone 24 to Milestone 25

Idk what's the state here - can someone explain?

#60 Updated by SLindoMansilla 9 months ago

  • Description updated (diff)
  • Status changed from Blocked to Workable

Blocker resolved: #19350

#61 Updated by SLindoMansilla 9 months ago

  • Description updated (diff)
  • Assignee deleted (SLindoMansilla)

#62 Updated by SLindoMansilla 9 months ago

  • Description updated (diff)

#63 Updated by zluo 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to zluo

take over. this is 2 years old ticket!

#64 Updated by zluo 9 months ago

https://openqa.suse.de/tests/2931505 shows boot_from_pxe works fine.

re-trigger it on osd because hostname_inst got wrongly scheduled for ipmi:
https://openqa.suse.de/tests/2956663#settings

#65 Updated by zluo 9 months ago

  • Status changed from In Progress to Rejected

grub_test failed, but this is another issue. So I don't see any problem for gnome test on ipmi.

set as rejected for now

#66 Updated by okurz 9 months ago

  • Status changed from Rejected to In Progress

It is true that "boot_from_pxe" is now more stable. However the ACs are not fulfilled, please see the description for that. It mentions four scenarios. Currently https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&groupid=129&groupid=110&groupid=132&build=228.2&arch=x86_64 shows only btrfs@ipmi so default@ipmi, gnome@ipmi and the corresponding two for SLE15 are missing.

#67 Updated by zluo 8 months ago

  • Target version changed from Milestone 25 to Milestone 26

Sergio has added gnome@64bit-ipmi now, need to check the test results for next build.

#68 Updated by SLindoMansilla 8 months ago

We still didn't have any build since I schedule the gnome@ipmi job.
If necessary we could perform an JOB POST typing the settings manually.

#70 Updated by SLindoMansilla 8 months ago

  • Description updated (diff)

#71 Updated by zluo 8 months ago

@sergio this is working for boot_from_pxe, can you put the job then into Job groups "SLES 12 functional"? Thanks!

#72 Updated by okurz 7 months ago

the test still fails in first_boot. Nothing changed because no one changed code: https://openqa.suse.de/tests/3038117#step/first_boot/7 so as long as this doesn't work you should not bring it into the validation job group.

#73 Updated by zluo 7 months ago

I thought boot_from_pxe was not working. it works now, firs_boot failed, I think this is another issue.

#74 Updated by zluo 7 months ago

to check https://openqa.suse.de/tests/3043347 (without reconnect_mgmt_console, grub_test)

to check https://openqa.suse.de/tests/3043448 (without grub_test) as well

#76 Updated by zluo 7 months ago

  • Status changed from In Progress to Blocked

https://openqa.suse.de/tests/3043347 shows that first_boot works fine if grub_test is not started before.

So we need to handle the issue reported: #53249

#77 Updated by zluo 7 months ago

  • Blocked by action #53249: [epic][functional][u] ensure that grub_test gets a booting system added

#78 Updated by zluo 7 months ago

  • Target version changed from Milestone 26 to Milestone 30+

#79 Updated by zluo 7 months ago

  • Status changed from Blocked to Workable

#80 Updated by zluo 7 months ago

  • Status changed from Workable to Blocked

#81 Updated by SLindoMansilla 7 months ago

  • Due date changed from 31/12/2018 to 27/09/2018

due to changes in a related task

#82 Updated by SLindoMansilla 7 months ago

  • Due date changed from 08/05/2018 to 27/09/2018

due to changes in a related task

#83 Updated by SLindoMansilla 7 months ago

  • Due date changed from 13/03/2018 to 27/09/2018

due to changes in a related task

#84 Updated by SLindoMansilla 7 months ago

  • Due date changed from 08/11/2017 to 27/09/2018

due to changes in a related task

#85 Updated by SLindoMansilla 7 months ago

  • Due date changed from 08/11/2017 to 27/09/2018

due to changes in a related task

#86 Updated by okurz 6 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3247183

#87 Updated by okurz 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3329449

#88 Updated by okurz 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3381530

#89 Updated by okurz 3 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3598216

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed

#90 Updated by okurz 3 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3649158

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed

#91 Updated by okurz 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3700456

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed

#92 Updated by okurz about 1 month ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3727073

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed

#93 Updated by mgriessmeier about 1 month ago

  • Target version changed from Milestone 30+ to Milestone 30

needs to be discussed offline

#94 Updated by okurz about 1 month ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: btrfs_libstorage-ng@64bit-ipmi
https://openqa.suse.de/tests/3770256

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed

Also available in: Atom PDF