Project

General

Profile

Actions

action #42026

closed

Add NVDIMM Bare Metal Server via IPMI to osd pool

Added by acarvajal over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2018-10-05
Due date:
% Done:

100%

Estimated time:

Description

Dell server holmes.qa.suse.de needs to be added to openqa.suse.de tool as a bare metal worker:

Here are the details:

FQDN: holmes.qa.suse.de
BMC: sp.holmes.qa.suse.de
Racktables: https://racktables.nue.suse.com/index.php?page=object&tab=edit&object_id=10699
QANet settings: https://gitlab.suse.de/qa-sle/qanet-configs/commit/6ac7bf475221f047a6d691acd9eda70f78c7e69d

Actions #1

Updated by acarvajal over 5 years ago

IPMI configured on server holmes, via sp.holmes.qa.suse.de.

Verified with:

ipmitool -I lanplus -H sp.holmes.qa.suse.de -U openqa -P ******** sdr elist all

Password set to openqa default password (same as worker 3 on openqaworker13)

Actions #2

Updated by acarvajal over 5 years ago

Actions #3

Updated by coolo over 5 years ago

  • Project changed from openQA Project to openQA Infrastructure
Actions #4

Updated by acarvajal over 5 years ago

Bare metal NVDIMM server added to openqa.suse.de via grenache-1:11 worker. Tests there can be triggered with WORKER_CLASS=64bit-ipmi-nvdimm, IPMI connection is working and VNC is starting successfully. However, currently there is a problem with the worker properly getting the information from the serial console, which causes the tests to fail before connecting via VNC, so still a work in progress.

To use, set: MACHINE=64bit-ipmi-nvdimm, BACKEND=ipmi, WORKER_CLASS=64bit-ipmi-nvdimm and SUT_NETDEVICE=em1.

Actions #5

Updated by acarvajal over 5 years ago

Currently I'm experiencing 2 problems with the setup of the Bare Metal NVDIMM server (Dell PowerEdge R640) in openqa.suse.de (grenache-1, worker 11):

1) The test "installation/disable_grub_timeout" which sets a timeout of -1 for the bootloader and which is loaded when VIRSH_VMM_TYPE is not 'linux', is causing the installed system to fail to boot after installation. Manually tested in the server with the same ISO image leaving the timeout in 8 seconds (the default) and also changing it to -1 like the openqa test does, and when the bootloader timeout is left in the default value the system successfully boots after installation, while changing it to -1 prevents the system from booting. Thinking of adding a second condition to load this test in lib/main_common.pm in os-autoinst-distri-opensuse, but not sure a "unless (check_var('VIRSH_VMM_TYPE', 'linux') or check_var('WORKER_CLASS', '64bit-ipmi-nvdimm'))" is the proper way to go. Instead thinking of adding a KEEP_GRUB_TIMEOUT variable to control that, and then add that variable in the machine definition in osd. Any thoughts?

2) Connection to SOL terminal seems to be not working after installation: https://openqa.suse.de/tests/2211601#step/qa_net_boot_from_hdd/3

During the startup of the test, the SOL terminal can be successfully accessed, and the needles related to QANET PXE boot are properly detected: https://openqa.suse.de/tests/2211601#step/boot_from_pxe/1

However, once the installation completes and the server reboots, the SOL terminal is selected, but not seen and the test fails to detect the QANET PXE boot menu.

Per the log, there seems to be an issue while attempting to select the SOL console after reboot:

[2018-10-25T15:09:08.0779 CEST] [debug] /var/lib/openqa/cache/tests/sle/tests/boot/qa_net_boot_from_hdd.pm:18 called testapi::select_console
[2018-10-25T15:09:08.0779 CEST] [debug] <<< testapi::select_console(testapi_console='sol', await_console=0)
/usr/lib/os-autoinst/consoles/vnc_base.pm:71:{
'hostname' => 'localhost',
'port' => 47389,
'ikvm' => 0
}
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":34686"
after 15738 requests (15581 known processed) with 0 events remaining.
[2018-10-25T15:09:08.0798 CEST] [debug] Driver backend collected unknown process with pid 71764 and exit status: 1
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":34686"
[2018-10-25T15:09:08.0802 CEST] [debug] Driver backend collected unknown process with pid 71766 and exit status: 84
[2018-10-25T15:09:09.0024 CEST] [debug] Connected to Xvnc - PID 81408
icewm PID is 81439
xterm PID is 81441
IceWM: using /var/lib/empty/.icewm for private configuration files
[2018-10-25T15:09:10.0112 CEST] [debug] activate_console, console: sol, type:
[2018-10-25T15:09:10.0112 CEST] [debug] activate_console called with generic type, no action

The port 34686 was the one used when setting up the SOL console at the start of the test:

[2018-10-25T14:31:49.0724 CEST] [debug] /var/lib/openqa/cache/tests/sle/tests/boot/boot_from_pxe.pm:31 called testapi::select_console
[2018-10-25T14:31:49.0724 CEST] [debug] <<< testapi::select_console(testapi_console='sol', await_console=0)
/usr/lib/os-autoinst/consoles/vnc_base.pm:71:{
'port' => 34686,
'hostname' => 'localhost',
'ikvm' => 0
}
[2018-10-25T14:31:49.0967 CEST] [debug] Connected to Xvnc - PID 71739
icewm PID is 71764
xterm PID is 71766
IceWM: using /var/lib/empty/.icewm for private configuration files
[2018-10-25T14:31:51.0056 CEST] [debug] activate_console, console: sol, type:
[2018-10-25T14:31:51.0057 CEST] [debug] activate_console called with generic type, no action

However, the same type of trace can be seen in the log of https://openqa.suse.de/tests/2203411#step/qa_net_boot_from_hdd/1 (which runs on a different server over IPMI), where SOL terminal seems to be working after installation.

Also tested with plain SLES (12-SP4, Build 0443) from openqa.suse.de with the same results as with the test done with SLES4SAP ISO.

Actions #7

Updated by acarvajal over 5 years ago

Machine defined in openqa.suse.de as 64bit-ipmi-nvdimm with ipmi backend and these settings:

IPMI_HW=dell
KEEP_GRUB_TIMEOUT=1
SERIALDEV=ttyS0
SUT_NETDEVICE=em1
TIMEOUT_SCALE=3
VNC_TYPING_LIMIT=5
WORKER_CLASS=64bit-ipmi-nvdimm
_CHKSEL_RATE_WAIT_TIME=120

Currently osd can install a working system on the bare metal server, but SOL connection over IPMI is not working once installation completes and system reboots. Currently nothing is seen in SOL and test boot/qa_net_boot_from_hdd fails while performing tests on this system.

Actions #8

Updated by acarvajal over 5 years ago

  • % Done changed from 0 to 80
Actions #9

Updated by acarvajal over 5 years ago

  • % Done changed from 80 to 90

Tested in a development server with https://github.com/os-autoinst/os-autoinst/pull/1021 applied, and the test is able to clear the boot/qa_net_boot_from_hdd test. Will wait for that to be deployed and test again on openqa.suse.de.

Actions #10

Updated by nicksinger over 5 years ago

  • Status changed from New to In Progress

@acarvajal given your latest comment I assume this ticket is at least "In Progress" if not even already in "Feedback" or even "Resolved". Therefore I'm taking my freedom to change the state. Feel free to adjust it to what ever state fits best for you :)

Actions #11

Updated by acarvajal over 5 years ago

Handle pmem devices in installation/partitioning_firstdisk:

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6252

Actions #12

Updated by acarvajal over 5 years ago

Further changes required in os-autoinst-distri-opensuse to use this server: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6282

Once these changes are merged, I will trigger a test on the server from openqa.suse.de. If that is successful, will finally close this ticket.

Actions #13

Updated by acarvajal over 5 years ago

Changes were merged into os-autoinst-distri-opensuse.

Job triggered in the Dell server: https://openqa.suse.de/tests/2286941

However backend is crashing during first_boot:

Nov 28 18:41:15 grenache-1 systemd-coredump[565126]: Core Dumping has been disabled for process 553478 (/usr/bin/isotov).
Nov 28 18:41:15 grenache-1 systemd-coredump[565126]: Process 553478 (/usr/bin/isotov) of user 480 dumped core.

grenache-1:~ # coredumpctl -1 info
PID: 553478 (/usr/bin/isotov)
UID: 480 (_openqa-worker)
GID: 65534 (nogroup)
Signal: 11 (SEGV)
Timestamp: Wed 2018-11-28 18:41:15 CET (8min ago)
Command Line: /usr/bin/isotovideo: backen
Executable: /usr/bin/perl
Control Group: /
Slice: -.slice
Boot ID: e7e27934bd5a43769cdf27fe0f775ce0
Machine ID: 3a6ed299dfcc384338b497455a69c107
Hostname: grenache-1
Message: Process 553478 (/usr/bin/isotov) of user 480 dumped core.

I saw some similar errors in my development station, but it was sporadic. Last working job there was: http://mango.suse.de/tests/680

Actions #14

Updated by zluo over 5 years ago

Actions #15

Updated by acarvajal about 5 years ago

After some minor fixes to the partitioning_firstdisk test in os-autoinst-distri-opensuse repo (https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6749), tested again this server over IPMI in osd, with a successful test:

https://openqa.suse.de/tests/2453860 (failure in sles4sap/wizard_hana_install was expected)

(osd was deployed with new versions on January 23rd)

However, all the re-triggered tests so far have been aborted with backend crashes:

https://openqa.suse.de/tests/2453874
https://openqa.suse.de/tests/2453911
https://openqa.suse.de/tests/2454331
https://openqa.suse.de/tests/2454859
https://openqa.suse.de/tests/2455039

Specifically, the backend is crashing when the installation/first_boot test issues a "select_console('x11')":

[2019-02-12T17:21:27.370 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/first_boot.pm:32 called testapi::select_console
[2019-02-12T17:21:27.370 CET] [debug] <<< testapi::assert_screen(mustmatch=[
'displaymanager',
'displaymanager-password-prompt',
'generic-desktop',
'screenlock',
'screenlock-password'
], timeout=30, no_wait=1)
[2019-02-12T17:21:30.173 CET] [debug] Driver backend collected unknown process with pid 253978 and exit status: 0
can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.

grenache-1:~ # coredumpctl | tail
Thu 2019-01-31 00:00:30 CET 619146 480 65534 11 /usr/bin/perl
Fri 2019-02-01 09:06:48 CET 144993 480 65534 11 /usr/bin/perl
Fri 2019-02-01 18:34:37 CET 284414 480 65534 11 /usr/bin/perl
Sat 2019-02-02 07:50:33 CET 392460 480 65534 11 /usr/bin/perl
Mon 2019-02-11 18:17:01 CET 73497 480 65534 11 /usr/bin/perl
Tue 2019-02-12 12:49:13 CET 208593 480 65534 11 /usr/bin/perl
Tue 2019-02-12 14:01:51 CET 218329 480 65534 11 /usr/bin/perl
Tue 2019-02-12 14:59:54 CET 226840 480 65534 11 /usr/bin/perl
Tue 2019-02-12 16:14:50 CET 238119 480 65534 11 /usr/bin/perl
Tue 2019-02-12 17:21:32 CET 247356 480 65534 11 /usr/bin/perl

Actions #16

Updated by acarvajal about 5 years ago

New test today:

https://openqa.suse.de/tests/2456313

No backend crash.

Actions #17

Updated by acarvajal about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Test is working now:

https://openqa.suse.de/tests/2471625

Still some instances of the test not being able to complete due to crashes on the backend, but it's certainly looking better:

https://openqa.suse.de/tests/2471625#next_previous

Last 5 attempts have shown no issues on the backend.

Will (finally) close this ticket.

Actions

Also available in: Atom PDF