action #41480

[sle][functional][u][ipmi] Malfunction of openqaworker2:25 - Investigate, bring it back or repair it (WAS: remove openqaworker2:25 (IPMI machine) from OSD testing)

Added by zluo over 1 year ago. Updated 6 months ago.

Status:ResolvedStart date:24/09/2018
Priority:HighDue date:
Assignee:SLindoMansilla% Done:

0%

Category:-
Target version:SUSE QA tests - Milestone 27
Duration:

Description

Observation

Please see https://progress.opensuse.org/issues/31375 #57, #61 for details.
openqaworker2:25 makes each time trouble and first_boot failed if it runs on osd.

Investigation

Hypotheses

  • H1 Issues caused by IPMI SUT machine (sp.fozzie.qa.suse.de [IPMI interface], fozzie-1.qa.suse.de [SUT]) - Rejected by E1-1
  • H2 Issues caused by IPMI WORKER machine (openqaworker2, jump host) - Rejected by E2-1
  • H3 Issues caused by openQA test module first_boot
  • H4 Issues caused by the openQA's IPMI backend.

Experiments

  • E1-1 Install SLE15-SP1 manually on sp.fozzie.qa.suse.de / fozzie-1.qa.suse.de mimicking openQA (ipmitool).
  • R1-1 Possible to perform the installation. Some impediments verified typing linuxrc parameters on PXE boot menu interacting though SOL.
  • E2-1 Install SLE15-SP1 manually from openqaworker2 into sp.fozzie.qa.suse.de / fozzie-1.qa.suse.de. (in progress)
  • R2-1 Possible to perform the installation. Some impediments verified typing linuxrc parameters on PXE boot menu interacting though SOL.
  • E3-1 Perform 10 local openQA job runs for scenario BTRFS to get statistics.

Suggestions

  • Conduct a proper statistical analysis on this specific machine and find out what component fails most often. For this the special worker class 64bit-ipmi_disabled_investigate_poo41480 can be used together with our approach for statistical investigation
  • Identify differences of that worker to others and see if the worker/machine/backend/test is special
  • Come up with fix in code or settings or a decision in what regard the hardware is broken/unusable and must be decommissioned/repaired (beware: costly decision!)

Further information

Known pitfalls

  • When pressing ESC during PXE boot menu countdown, a boot prompt appears where you can type linuxrc parameters. But, you cannot delete(->) nor backspace (<-).
  • The PXE boot menu used by fozzie-1.qa.suse.de [SUT] has 3 levels:
    1. OS version menu (12, 15, 15-SP1...)
    2. Installation media source (NFS, FTP, HTTP...)
    3. Installaiion mode (SSH, VNC...)
  • When selecting a final entry and pressing TAB, the boot line appears and can be edited. But, you cannot backspace (<-) and typing and deleting characters at the limit of right margin causes unexpected broken characters. Also, CTRL-A (beggining) and CTRL-E (end) works. With cursor keys it is possible to move, but pressing fast too many times and keeping it pressed, causes unexpected broken characters.
  • The needed serial device to see the installer output is /dev/ttyS1

Related issues

Related to openQA Tests - action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - ... Workable 20/10/2017
Blocks openQA Tests - action #31375: [sle][functional][ipmi][u][hard] test fails in first_boot... Rejected 05/02/2018

History

#1 Updated by okurz over 1 year ago

  • Subject changed from [sle][functional] remove openqaworker2:25 (ipmi machine) from osd testing to [sle][functional][u][ipmi] remove openqaworker2:25 (ipmi machine) from osd testing
  • Due date set to 09/10/2018
  • Category set to Infrastructure
  • Priority changed from Normal to High
  • Target version set to Milestone 20

@zluo The links in the ticket description look weird. Did you want to reference comments within that ticket? That would be #31375#note-57 and #31375#note-61 then.

if you like to try you can create a merge request on https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls yourself. Take https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/118 as an example

The machine in question is fozzie

#2 Updated by okurz over 1 year ago

  • Description updated (diff)

#3 Updated by okurz over 1 year ago

  • Blocks action #31375: [sle][functional][ipmi][u][hard] test fails in first_boot - VNC installation on SLE 15 failed because of various issues (ipmi worker, first_boot, boot_from_pxe, await_install) added

#4 Updated by zluo over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to zluo

#5 Updated by okurz over 1 year ago

merged. Please make sure to keep this ticket or any other ticket open until we have the worker back in production (or decomissioned) to not forget the worker and leave it dangling.

#6 Updated by zluo over 1 year ago

  • Status changed from In Progress to Resolved

openqaworker2:25 disabled on osd.

#7 Updated by okurz over 1 year ago

  • Status changed from Resolved to In Progress

you might have overlooked my comment in #41480#note-5

#8 Updated by zluo over 1 year ago

  • Assignee deleted (zluo)

We have now much better test results from impi on osd. Please make a decision for further step: shutdown this ipmi machine or still try to investigate this it.

#9 Updated by okurz over 1 year ago

  • Status changed from In Progress to Workable

As discussed in person we need to conduct a proper statistical analysis on this specific machine and see how it fares.

#10 Updated by okurz over 1 year ago

  • Subject changed from [sle][functional][u][ipmi] remove openqaworker2:25 (ipmi machine) from osd testing to [sle][functional][u][ipmi] remove openqaworker2:25 (ipmi machine) from osd testing, investigate, bring it back or decomission

#11 Updated by SLindoMansilla over 1 year ago

  • Subject changed from [sle][functional][u][ipmi] remove openqaworker2:25 (ipmi machine) from osd testing, investigate, bring it back or decomission to [sle][functional][u][ipmi] Malfunction of openqaworker2:25 - Investigate, bring it back or decommission it (WAS: remove openqaworker2:25 (IPMI machine) from OSD testing)
  • Assignee set to SLindoMansilla

#12 Updated by okurz over 1 year ago

  • Description updated (diff)

#13 Updated by SLindoMansilla over 1 year ago

  • Description updated (diff)
  • Status changed from Workable to In Progress

#14 Updated by SLindoMansilla over 1 year ago

Trying to install SLE 15 SP1:

Having problems typing linuxrc parameters.
When the line is bigger than the "screen", it gets cut off and I am not able to see what I am typing.

This problem was already observed on openQA.

#15 Updated by okurz over 1 year ago

  • Due date changed from 09/10/2018 to 23/10/2018

#16 Updated by SLindoMansilla over 1 year ago

  • Description updated (diff)

After using the last installation media from openqa.suse.de, I was able to install SLE15-SP1. The only unreliable part is the interaction through ipmitool ... sol activate.
I assume I could improve this interaction with a better understanding of the relationship between SOL, serial consoles and serial devices on IPMI. (rpalethorpe could help)

#17 Updated by SLindoMansilla over 1 year ago

  • Description updated (diff)

#18 Updated by SLindoMansilla over 1 year ago

  • Description updated (diff)

#19 Updated by SLindoMansilla over 1 year ago

  • Description updated (diff)

#20 Updated by SLindoMansilla over 1 year ago

By now not able to get a run. I get only incompletes with following error message:

[2018-10-11T18:35:17.0771 CEST] [debug] <<< testapi::wait_serial(regexp='SysRq : Show Blocked State', timeout=1)                                                                                                                     [16/174]
DIE 'current_console' is not set at /usr/lib/os-autoinst/backend/baseclass.pm line 732.
        backend::baseclass::wait_serial(backend::ipmi=HASH(0x55e734cde558), HASH(0x55e7334b9160)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 75                                                                               
        backend::baseclass::handle_command(backend::ipmi=HASH(0x55e734cde558), HASH(0x55e733fb2838)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 487                                                                           
        backend::baseclass::check_socket(backend::ipmi=HASH(0x55e734cde558), IO::Handle=GLOB(0x55e733522470), 0) called at /usr/lib/os-autoinst/backend/ipmi.pm line 131                                                                    
        backend::ipmi::check_socket(backend::ipmi=HASH(0x55e734cde558), IO::Handle=GLOB(0x55e733522470), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 246                                                                    
        eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 156
        backend::baseclass::run_capture_loop(backend::ipmi=HASH(0x55e734cde558)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 129                                                                                               
        backend::baseclass::run(backend::ipmi=HASH(0x55e734cde558), 5, 8) called at /usr/lib/os-autoinst/backend/driver.pm line 85                                                                                                          
        backend::driver::start(backend::driver=HASH(0x55e733fc6310)) called at /usr/lib/os-autoinst/backend/driver.pm line 48                                                                                                               
        backend::driver::new("backend::driver", "ipmi") called at /usr/bin/isotovideo line 236
        main::init_backend() called at /usr/bin/isotovideo line 305

 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
        backend::baseclass::die_handler("'current_console' is not set at /usr/lib/os-autoinst/backend/"...) called at /usr/lib/perl5/5.26.1/Carp.pm line 168                                                                                
        Carp::confess("'current_console' is not set") called at /usr/lib/os-autoinst/backend/baseclass.pm line 732
        backend::baseclass::wait_serial(backend::ipmi=HASH(0x55e734cde558), HASH(0x55e7334b9160)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 75                                                                               
        backend::baseclass::handle_command(backend::ipmi=HASH(0x55e734cde558), HASH(0x55e733fb2838)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 487                                                                           
        backend::baseclass::check_socket(backend::ipmi=HASH(0x55e734cde558), IO::Handle=GLOB(0x55e733522470), 0) called at /usr/lib/os-autoinst/backend/ipmi.pm line 131
        backend::ipmi::check_socket(backend::ipmi=HASH(0x55e734cde558), IO::Handle=GLOB(0x55e733522470), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 246
        eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 156
        backend::baseclass::run_capture_loop(backend::ipmi=HASH(0x55e734cde558)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 129
        backend::baseclass::run(backend::ipmi=HASH(0x55e734cde558), 5, 8) called at /usr/lib/os-autoinst/backend/driver.pm line 85
        backend::driver::start(backend::driver=HASH(0x55e733fc6310)) called at /usr/lib/os-autoinst/backend/driver.pm line 48
        backend::driver::new("backend::driver", "ipmi") called at /usr/bin/isotovideo line 236
        main::init_backend() called at /usr/bin/isotovideo line 305

Example: http://slindomansilla-vm.qa.suse.de/tests/155

#21 Updated by okurz over 1 year ago

That sysrq-call has been put in recently. dheidler was involved. I think I talked with him about excluding IPMI but so far only hyper-v is excluded, don't know why. However, that sysrq-call is only ever triggered in a post_fail_hook so that can not be your original problem and it is not. http://slindomansilla-vm.qa.suse.de/tests/155/file/autoinst-log.txt clearly shows that the isosize test module fails. It's interesting though that https://openqa.suse.de/tests/2163265/file/autoinst-log.txt shows that the IPMI job also downloads and checks the ISO which obviously does not make much sense for IPMI. Maybe you want to pick up #38807 again? ;)

#22 Updated by SLindoMansilla over 1 year ago

The ticket #38807 is like a black hole. We will end up the sprint with none of both issues resolved.
If possible, I may find a workaround for this to have at least this IPMI machine working on OSD.

#23 Updated by SLindoMansilla over 1 year ago

  • Status changed from In Progress to Feedback

#24 Updated by SLindoMansilla over 1 year ago

  • Status changed from Feedback to Blocked

Let's investigate BOOT_FROM_PXE before giving the machine back to OSD: #36027

#25 Updated by SLindoMansilla over 1 year ago

  • Blocked by action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all added

#26 Updated by coolo over 1 year ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Category deleted (Infrastructure)

#27 Updated by okurz over 1 year ago

  • Subject changed from [sle][functional][u][ipmi] Malfunction of openqaworker2:25 - Investigate, bring it back or decommission it (WAS: remove openqaworker2:25 (IPMI machine) from OSD testing) to [sle][functional][u][ipmi] Malfunction of openqaworker2:25 - Investigate, bring it back or repair it (WAS: remove openqaworker2:25 (IPMI machine) from OSD testing)

I brought back fozzie into production with https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/129 as asked by xlai. We have not yet ensured proper stability of this machine within this ticket but it seems that we have generic problems for all IPMI machines so keeping the machine disabled for production is also not the best option.

#28 Updated by okurz about 1 year ago

  • Due date deleted (23/10/2018)
  • Target version changed from Milestone 20 to Milestone 22

adjusting current estimate based on the blocking ticket which is currently not being worked on.

#29 Updated by okurz about 1 year ago

  • Target version changed from Milestone 22 to Milestone 24

#30 Updated by mgriessmeier 9 months ago

  • Target version changed from Milestone 24 to Milestone 25

#31 Updated by mgriessmeier 8 months ago

  • Target version changed from Milestone 25 to Milestone 26

#32 Updated by mgriessmeier 6 months ago

  • Target version changed from Milestone 26 to Milestone 27

still valid? should it be moved over to nick or tools team?

#33 Updated by coolo 6 months ago

Hardware issues were sorted out months ago - next stop are still first_boot experiments

#34 Updated by SLindoMansilla 6 months ago

  • Blocked by deleted (action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all)

#35 Updated by SLindoMansilla 6 months ago

  • Status changed from Blocked to Resolved

So, let's resolve this in favor of #38423

#36 Updated by SLindoMansilla 6 months ago

  • Related to action #36027: [sle][functional][u][ipmi] test fails in boot_from_pxe - pxe boot menu doesn't show up at all added

Also available in: Atom PDF