Project

General

Profile

Actions

action #139271

closed

Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M

Added by okurz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-09-20
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In FC Basement we have currently unused PowerPC hardware that maybe can be put to good use, namely

Acceptance criteria

  • AC1: DONE mania is actively used
  • AC2: DONE production openQA OSD jobs are able to execute successfully on mania
  • AC3: DONE Our inventory management system is up-to-date regarding the use and configuration of mania
  • AC4: DONE Workerconf has the IPMI command to access the machine

Suggestions

  • DONE Offer the machines to be investigated for use in coordination with hsehic, former coordinator for those machines
  • DONE Ensure the machine is used for something useful, e.g. openQA OSD worker

Out of scope

ppc64le multi-machine jobs -> #136130

Further details

Potentially interesting and related: https://bugzilla.suse.com/show_bug.cgi?id=1041747, never fixed, workarounds mentioned like QEMUCPUS=1; QEMUTHREADS=1 which likely should not be needed anymore on recent OS kernels.


Related issues 13 (2 open11 closed)

Related to openQA Project - coordination #139010: [epic] Long OSD ppc64le job queueNew2023-11-04

Actions
Related to openQA Tests - action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:MResolvedmkittler2023-09-20

Actions
Related to openQA Tests - action #60257: [functional][u][mistyping] random typing issues on waylandRejectedmgriessmeier2019-11-25

Actions
Related to openQA Tests - action #54914: [Leap-15.4][qe-core][functional] Mistyping in some modules: vlc, x_vt (formerly xorg_vt), etc.New2019-07-31

Actions
Related to openQA Tests - action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) ResolvedSLindoMansilla2019-01-15

Actions
Related to openQA Tests - action #42362: [qe-core][qem][virtio][sle15sp0][sle15sp1][desktop][typing] test fails in window_system because "typing string is too fast in wayland"Resolvedzcjia2018-10-12

Actions
Related to openQA Infrastructure - action #42032: [sle][functional][u][tools] QA-power-8-*-kvm have missing keys (ninjakeys)Resolvedokurz2018-10-05

Actions
Related to openQA Tests - action #40670: [sle][functional][u][medium][tools][sporadic] test fails in welcome - typos in kernel boot parametersResolvedSLindoMansilla2018-09-06

Actions
Related to openQA Tests - action #26910: [ppc64le][tools][BLOCKED][bsc#1041747] PowerPC looses keysResolvedokurz2017-10-202018-03-27

Actions
Related to openQA Tests - action #68218: [opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshotResolved2020-06-18

Actions
Related to openQA Infrastructure - action #30595: [ppc64le] Deploy bcache on tmpfs workersClosed2018-01-22

Actions
Copied from openQA Infrastructure - action #136121: Repurpose PowerPC hardware in FC Basement - deagolResolvedokurz2023-09-20

Actions
Copied to openQA Infrastructure - action #150983: CPU Load and usage alert for openQA workers size:SResolvedokurz

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #136121: Repurpose PowerPC hardware in FC Basement - deagol added
Actions #2

Updated by okurz 6 months ago

Actions #3

Updated by okurz 6 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready

Connected power cables and system ethernet, updated racktables entry https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9588 . Logged into https://fsp1-mania.qe.nue2.suse.org/cgi-bin/cgi with admin/admin. IPMI access also works with ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P admin without username. Over ASM menu "System Configuration > Firmware Configuration" set to OPAL and powered on over IPMI and connected SoL. Eventually petitboot came up. From there I could go to shell and confirm that we have network on the device "enP5p1s0f0" with L2 address 40:f2:e9:31:54:60 so I added that to racktables. Following #119008-6 I can come up with a netboot of the installer.

wget http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/boot/ppc64le/linux
wget http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/boot/ppc64le/initrd
kexec -l linux --initrd initrd --command-line 'install=http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/ autoyast=https://is.gd/oqaay5 sshd=1 sshpassword=linux rootpassword=linux console=tty0 console=hvc0'

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4398 for an update of DHCP/DNS entries.

And in the installed system installed old kernel according to #116078-81 and package lock with

for i in kernel-default util-linux "kernel*"; do zypper al --comment "poo#119008, kernel regression boo#1202138" $i; done
Actions #4

Updated by openqa_review 6 months ago

  • Due date set to 2023-11-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by okurz 6 months ago

I had to tweak the network setting during installation as the installer always insisted on using the not connected eth1 but eventually I ended up with a properly ssh reachable installed system although I ended up installing Leap 15.4 . I can try to reinstall or upgrade to Leap 15.5., default old root password

Actions #6

Updated by okurz 6 months ago

  • Due date changed from 2023-11-25 to 2023-12-15
  • Status changed from In Progress to Blocked

I checked storage performance and sda,sdb,sdc,sdd,sde,sdf,sdg all seem to have the same performance. So I would say it's ok to stay with root partition on sda, potentially change to RAID, and use something sdf+sdg, both 1.1TB, for /var/lib/openqa for a cache and pool.

mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=2 --run /dev/sd{f,g}
/sbin/mkfs.ext2 -F /dev/md/openqa
mkdir -p /var/lib/openqa
echo '/dev/md/openqa /var/lib/openqa ext2 defaults 0 0' >> /etc/fstab
mount -a

and now waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4398 because I would like to prevent problems with no proper static FQDN.

Actions #7

Updated by okurz 6 months ago

  • Description updated (diff)
Actions #8

Updated by okurz 6 months ago

  • Status changed from Blocked to In Progress

MR was merged and deployed, thanks to gpfuetzenreuther. wicked ifup all on mania made it use the new IP and the machine is reachable under the FQDN. Rebooting to make that fully effective and then add to salt.

Actions #9

Updated by okurz 6 months ago

  • Related to action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
Actions #10

Updated by okurz 6 months ago

  • Subject changed from Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC to Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M
  • Description updated (diff)
Actions #11

Updated by okurz 6 months ago

Added to salt, applied high state multiple times and triggered verification jobs:

openqa-clone-job --repeat 20 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz WORKER_CLASS=mania

-> https://openqa.suse.de/tests/overview?groupid=110&version=15-SP6&distri=sle&build=poo139271-okurz

one erroneous key repetition error in https://openqa.suse.de/tests/12811244#step/installation_snapshots/4, that's making it 1/20 fails, 5.0%, IMHO too much. Let's see if we can reproduce that more easily with more instances. On mania did

systemctl enable --now openqa-worker-auto-restart@{11..30}

and triggering 200 for a bigger set now with

openqa-clone-job --repeat 200 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2 _GROUP=0 WORKER_CLASS=mania

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2

that shows 12 fails vs. 158 passed so 7.6% failures, rather high. But No significant problem due to the increase of instances.

Asking in https://suse.slack.com/archives/C02CANHLANP/p1700035576832419 , maybe somebody has an idea

Hi, I am in the process of adding another Power8 machine for qemu-ppc64le kvm testing to OSD. As part of that I ran some openQA test jobs, e.g. see https://openqa.suse.de/tests/overview?build=poo139271-okurz_2 . There are some test failures, 3 in postgresql_server due to timeout on service startup, 7 due to key repetition in installation_snapshots. Do those failures ring a bell for anyone? Does anyone remember some special PowerPC kernel flag that would prevent that? If we can't solve that problem then I wouldn't add the machine to production worker classes hence next kernel live-patch testing might be delayed again significantly

And for reference calling:

for i in diesel petrol; do openqa-clone-job --repeat 40 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2-reference_$i _GROUP=0 WORKER_CLASS=$i; done

on diesel+petrol.

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol

My hypothesis for https://openqa.suse.de/tests/overview?modules=installation_snapshots&modules_result=failed&build=poo139271-okurz_2 is that rotating disk storage on mania is causing qemu snapshot creation to make virtual machines be partially unresponsive or sluggish and causing mistyping. https://openqa.suse.de/tests/12811789/logfile?filename=autoinst-log.txt showing

[2023-11-15T07:27:51.425927Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found cleared-console-root-20200612, similarity 1.00 @ 97/3
[2023-11-15T07:27:51.431856Z] [debug] [pid:50570] ||| finished hostname console (runtime: 73 s)
[2023-11-15T07:27:51.433277Z] [debug] [pid:50570] Creating a VM snapshot lastgood
…
[2023-11-15T07:28:52.597378Z] [debug] [pid:50623] Snapshot complete
[2023-11-15T07:29:06.015723Z] [debug] [pid:50623] EVENT {"event":"RESUME","timestamp":{"microseconds":15462,"seconds":1700033346}}
[2023-11-15T07:29:06.020447Z] [debug] [pid:50570] ||| starting installation_snapshots tests/console/installation_snapshots.pm
[2023-11-15T07:29:06.022067Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/hd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=21474836480 backing_file=/var/lib/openqa/pool/2/raid/hd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022158Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/cd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=541298688 backing_file=/var/lib/openqa/pool/2/raid/cd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022875Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console
[2023-11-15T07:29:06.023013Z] [debug] [pid:50570] <<< testapi::select_console(testapi_console="root-console")
[2023-11-15T07:29:06.426979Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console -> lib/susedistribution.pm:986 called testapi::assert_screen
[2023-11-15T07:29:06.427226Z] [debug] [pid:50570] <<< testapi::assert_screen(mustmatch="root-console", no_wait=1, timeout=30)
[2023-11-15T07:29:06.657636Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found root-console-top-20200304, similarity 1.00 @ 96/2
[2023-11-15T07:29:06.657988Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658190Z] [debug] [pid:50570] <<< testapi::assert_script_run(cmd="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", quiet=undef, timeout=90, fail_message="")
[2023-11-15T07:29:06.658346Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658532Z] [debug] [pid:50570] <<< testapi::type_string(string="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:09.742616Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:09.742951Z] [debug] [pid:50570] <<< testapi::type_string(string="; echo yv1NU-\$?- > /dev/hvc0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:10.806876Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:10.807177Z] [debug] [pid:50570] <<< testapi::wait_serial(timeout=90, expect_not_found=0, no_regex=0, buffer_size=undef, regexp=qr/yv1NU-\d+-/u, quiet=undef, record_output=undef)
[2023-11-15T07:29:12.910833Z] [debug] [pid:50570] >>> testapi::wait_serial: (?^u:yv1NU-\d+-): ok
[2023-11-15T07:29:13.063412Z] [info] [pid:50570] ::: basetest::runtest: # Test died: command 'snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'' failed at /usr/lib/os-autoinst/testapi.pm line 926.

so the snapshot is created, the VM is resumed and then the command "snapper list" is mistyped shortly after. As an experiment triggering tests with disabled snapshots to check on this hypothesis:

openqa-clone-job --repeat 80 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_3_disable_snapshots _GROUP=0 WORKER_CLASS=mania QEMU_DISABLE_SNAPSHOTS=1

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots

let's see if the reference builds or this one shows a change.

And for IPMI credentials

ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user 2 set name ADMIN
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user set password 2 …
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … channel setaccess 1 2 link=on ipmi=on callin=on privilege=4
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user enable 2

And I rebooted due to some failures like https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1980642#L1136

Failed to set net.ipv4.conf.br1.forwarding to 1: sysctl net.ipv4.conf.br1.forwarding does not exist

Fixed after reboot.

If problems regarding mistyping persists then maybe we can also tweak some parameters in

sudo salt -C 'G@osarch:ppc64le' cmd.run 'for i in /sys/module/kvm*/parameters/*; do echo "$i: $(cat $i)"; done'
petrol.qe.nue2.suse.org:
    /sys/module/kvm/parameters/halt_poll_ns: 10000
    /sys/module/kvm/parameters/halt_poll_ns_grow: 2
    /sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
    /sys/module/kvm/parameters/halt_poll_ns_shrink: 0
    /sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
    /sys/module/kvm_hv/parameters/h_ipi_redirect: 1
    /sys/module/kvm_hv/parameters/indep_threads_mode: Y
    /sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
    /sys/module/kvm_hv/parameters/nested: Y
    /sys/module/kvm_hv/parameters/one_vm_per_core: N
    /sys/module/kvm_hv/parameters/target_smt_mode: 0
diesel.qe.nue2.suse.org:
    /sys/module/kvm/parameters/halt_poll_ns: 10000
    /sys/module/kvm/parameters/halt_poll_ns_grow: 2
    /sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
    /sys/module/kvm/parameters/halt_poll_ns_shrink: 0
    /sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
    /sys/module/kvm_hv/parameters/h_ipi_redirect: 1
    /sys/module/kvm_hv/parameters/indep_threads_mode: Y
    /sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
    /sys/module/kvm_hv/parameters/nested: Y
    /sys/module/kvm_hv/parameters/one_vm_per_core: N
    /sys/module/kvm_hv/parameters/target_smt_mode: 0
mania.qe.nue2.suse.org:
    /sys/module/kvm/parameters/halt_poll_ns: 10000
    /sys/module/kvm/parameters/halt_poll_ns_grow: 2
    /sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
    /sys/module/kvm/parameters/halt_poll_ns_shrink: 0
    /sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
    /sys/module/kvm_hv/parameters/h_ipi_redirect: 1
    /sys/module/kvm_hv/parameters/indep_threads_mode: Y
    /sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
    /sys/module/kvm_hv/parameters/nested: Y
    /sys/module/kvm_hv/parameters/one_vm_per_core: N
    /sys/module/kvm_hv/parameters/target_smt_mode: 0

e.g. "one_vm_per_core"

Actions #12

Updated by okurz 6 months ago

  • Related to action #60257: [functional][u][mistyping] random typing issues on wayland added
Actions #13

Updated by okurz 6 months ago

  • Related to action #54914: [Leap-15.4][qe-core][functional] Mistyping in some modules: vlc, x_vt (formerly xorg_vt), etc. added
Actions #14

Updated by okurz 6 months ago

  • Related to action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) added
Actions #15

Updated by okurz 6 months ago

  • Related to action #42362: [qe-core][qem][virtio][sle15sp0][sle15sp1][desktop][typing] test fails in window_system because "typing string is too fast in wayland" added
Actions #16

Updated by okurz 6 months ago

  • Related to action #42032: [sle][functional][u][tools] QA-power-8-*-kvm have missing keys (ninjakeys) added
Actions #17

Updated by okurz 6 months ago

  • Related to action #40670: [sle][functional][u][medium][tools][sporadic] test fails in welcome - typos in kernel boot parameters added
Actions #18

Updated by okurz 6 months ago

  • Related to action #26910: [ppc64le][tools][BLOCKED][bsc#1041747] PowerPC looses keys added
Actions #19

Updated by okurz 6 months ago

  • Related to action #68218: [opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot added
Actions #20

Updated by okurz 6 months ago

  • Description updated (diff)
Actions #21

Updated by okurz 6 months ago

  • Tags changed from infra, fc-basement, PowerPC to infra, fc-basement, PowerPC, next-frankencampus-visit

https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=1699983668665&to=1700055939342&viewPanel=56720 supports the hypothesis that snapshotting in tests is causing problems. As can be seen over the first hours openQA tests with qemu snapshots enabled have been running causing Disk I/O time to max out at 1s, later without snapshots the situation is much better. I can look into fitting SSDs in FC Basement.

Actions #22

Updated by okurz 6 months ago

  • Related to action #30595: [ppc64le] Deploy bcache on tmpfs workers added
Actions #23

Updated by okurz 6 months ago

all builds https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel, https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol, https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots show 100% pass ratio so maybe I could build in more performant ssds tomorrow. If that is not possible or does not help then we should look for more according storage and/or keep qemu snapshots disabled. Or tmpfs? But then how to ensure the directory structure is always created? Also see #30595 or #106841

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/678, merged, deploying.

Actions #24

Updated by okurz 5 months ago

Monitoring https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-24h&to=now . mania is very stable. There were periods of stalling I/O and I am still considering to use a tmpfs but not for /var/lib/openqa/share but only /var/lib/openqa/share/pool so that I don't have a problem with missing directory structure.

I did

systemctl stop openqa-worker-auto-restart@{1..30}
rm -rf /var/lib/openqa/pool/*
mount -t tmpfs -o size=512G tmpfs /var/lib/openqa/pool
systemctl start openqa-worker-auto-restart@{1..50}

so 50 (!) instances to stress the system. Then triggered tests with

openqa-clone-job --repeat 100 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_4_50_instances_pool_on_tmpfs _GROUP=0 WORKER_CLASS=mania

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_4_50_instances_pool_on_tmpfs with 8/100 fails and https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=1700161086904&to=1700207814794 showed the system to be clearly overwhelmed. So I stopped again instances 31..50

In the failures I see symptoms of lost keys or stuck keys so good to know what symptoms of overloaded systems are.

I wonder if we can share mania for manual testing as well as openQA worker. Maybe if we limit the ressources that libvirtd can take we can ensure a proper operation of openQA unharmed.

Actions #25

Updated by okurz 5 months ago

  • Copied to action #150983: CPU Load and usage alert for openQA workers size:S added
Actions #27

Updated by okurz 5 months ago

  • Description updated (diff)

I added to /etc/fstab

/dev/md/openqa /var/lib/openqa                          ext2   defaults              0  0
none /var/lib/openqa/pool tmpfs size=512G 0 0
openqa.suse.de:/var/lib/openqa/share        /var/lib/openqa/share   nfs ro,noauto,nofail,retry=30,x-systemd.mount-timeout=30m,x-systemd.device-timeout=10m,x-systemd.automount  0 0

Right now a lot of jobs are running. Monitoring that and rebooting and putting back into salt when the job queue has reduced.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/679 to enable qemu snapshots again after tmpfs in place.

I am currently trying to add according alerts for CPU Load.

Actions #28

Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback
Actions #29

Updated by okurz 5 months ago

  • Tags changed from infra, fc-basement, PowerPC, next-frankencampus-visit to infra, fc-basement, PowerPC
  • Description updated (diff)

merged and deployed. Monitoring over the next time. Also in particular waiting for next automatically triggered reboot.

Actions #30

Updated by okurz 5 months ago

  • Due date deleted (2023-12-15)
  • Status changed from Feedback to Resolved

mania is up after the latest automatic reboot and all jobs that I could see are very stable. All good now. Let's stay with 30 instances.

Actions

Also available in: Atom PDF