action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz over 1 year ago

Copied from action #136121: Repurpose PowerPC hardware in FC Basement - deagol added

Actions

Copy link

#2

Updated by okurz over 1 year ago

Related to coordination #139010: [epic] Long OSD ppc64le job queue added

Actions

Copy link

#3

Updated by okurz over 1 year ago

Status changed from New to In Progress
Assignee set to okurz
Target version changed from future to Ready

Connected power cables and system ethernet, updated racktables entry https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9588 . Logged into https://fsp1-mania.qe.nue2.suse.org/cgi-bin/cgi with admin/admin. IPMI access also works with ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P admin without username. Over ASM menu "System Configuration > Firmware Configuration" set to OPAL and powered on over IPMI and connected SoL. Eventually petitboot came up. From there I could go to shell and confirm that we have network on the device "enP5p1s0f0" with L2 address 40:f2:e9:31:54:60 so I added that to racktables. Following #119008-6 I can come up with a netboot of the installer.

wget http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/boot/ppc64le/linux
wget http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/boot/ppc64le/initrd
kexec -l linux --initrd initrd --command-line 'install=http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/ autoyast=https://is.gd/oqaay5 sshd=1 sshpassword=linux rootpassword=linux console=tty0 console=hvc0'

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4398 for an update of DHCP/DNS entries.

And in the installed system installed old kernel according to #116078-81 and package lock with

for i in kernel-default util-linux "kernel*"; do zypper al --comment "poo#119008, kernel regression boo#1202138" $i; done

Actions

Copy link

#4

Updated by openqa_review over 1 year ago

Due date set to 2023-11-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#5

Updated by okurz over 1 year ago

I had to tweak the network setting during installation as the installer always insisted on using the not connected eth1 but eventually I ended up with a properly ssh reachable installed system although I ended up installing Leap 15.4 . I can try to reinstall or upgrade to Leap 15.5., default old root password

Actions

Copy link

#6

Updated by okurz over 1 year ago

Due date changed from 2023-11-25 to 2023-12-15
Status changed from In Progress to Blocked

I checked storage performance and sda,sdb,sdc,sdd,sde,sdf,sdg all seem to have the same performance. So I would say it's ok to stay with root partition on sda, potentially change to RAID, and use something sdf+sdg, both 1.1TB, for /var/lib/openqa for a cache and pool.

mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=2 --run /dev/sd{f,g}
/sbin/mkfs.ext2 -F /dev/md/openqa
mkdir -p /var/lib/openqa
echo '/dev/md/openqa /var/lib/openqa ext2 defaults 0 0' >> /etc/fstab
mount -a

and now waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4398 because I would like to prevent problems with no proper static FQDN.

Actions

Copy link

#7

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#8

Updated by okurz over 1 year ago

Status changed from Blocked to In Progress

MR was merged and deployed, thanks to gpfuetzenreuther. wicked ifup all on mania made it use the new IP and the machine is reachable under the FQDN. Rebooting to make that fully effective and then add to salt.

Actions

Copy link

#9

Updated by okurz over 1 year ago

Related to action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added

Actions

Copy link

#10

Updated by okurz over 1 year ago

Subject changed from Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC to Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M
Description updated (diff)

Actions

Copy link

#11

Updated by okurz over 1 year ago

Added to salt, applied high state multiple times and triggered verification jobs:

openqa-clone-job --repeat 20 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz WORKER_CLASS=mania

-> https://openqa.suse.de/tests/overview?groupid=110&version=15-SP6&distri=sle&build=poo139271-okurz

one erroneous key repetition error in https://openqa.suse.de/tests/12811244#step/installation_snapshots/4, that's making it 1/20 fails, 5.0%, IMHO too much. Let's see if we can reproduce that more easily with more instances. On mania did

systemctl enable --now openqa-worker-auto-restart@{11..30}

and triggering 200 for a bigger set now with

openqa-clone-job --repeat 200 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2 _GROUP=0 WORKER_CLASS=mania

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2

that shows 12 fails vs. 158 passed so 7.6% failures, rather high. But No significant problem due to the increase of instances.

Asking in https://suse.slack.com/archives/C02CANHLANP/p1700035576832419 , maybe somebody has an idea

Hi, I am in the process of adding another Power8 machine for qemu-ppc64le kvm testing to OSD. As part of that I ran some openQA test jobs, e.g. see https://openqa.suse.de/tests/overview?build=poo139271-okurz_2 . There are some test failures, 3 in postgresql_server due to timeout on service startup, 7 due to key repetition in installation_snapshots. Do those failures ring a bell for anyone? Does anyone remember some special PowerPC kernel flag that would prevent that? If we can't solve that problem then I wouldn't add the machine to production worker classes hence next kernel live-patch testing might be delayed again significantly

And for reference calling:

for i in diesel petrol; do openqa-clone-job --repeat 40 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2-reference_$i _GROUP=0 WORKER_CLASS=$i; done

on diesel+petrol.

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol

My hypothesis for https://openqa.suse.de/tests/overview?modules=installation_snapshots&modules_result=failed&build=poo139271-okurz_2 is that rotating disk storage on mania is causing qemu snapshot creation to make virtual machines be partially unresponsive or sluggish and causing mistyping. https://openqa.suse.de/tests/12811789/logfile?filename=autoinst-log.txt showing

[2023-11-15T07:27:51.425927Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found cleared-console-root-20200612, similarity 1.00 @ 97/3
[2023-11-15T07:27:51.431856Z] [debug] [pid:50570] ||| finished hostname console (runtime: 73 s)
[2023-11-15T07:27:51.433277Z] [debug] [pid:50570] Creating a VM snapshot lastgood
…
[2023-11-15T07:28:52.597378Z] [debug] [pid:50623] Snapshot complete
[2023-11-15T07:29:06.015723Z] [debug] [pid:50623] EVENT {"event":"RESUME","timestamp":{"microseconds":15462,"seconds":1700033346}}
[2023-11-15T07:29:06.020447Z] [debug] [pid:50570] ||| starting installation_snapshots tests/console/installation_snapshots.pm
[2023-11-15T07:29:06.022067Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/hd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=21474836480 backing_file=/var/lib/openqa/pool/2/raid/hd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022158Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/cd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=541298688 backing_file=/var/lib/openqa/pool/2/raid/cd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022875Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console
[2023-11-15T07:29:06.023013Z] [debug] [pid:50570] <<< testapi::select_console(testapi_console="root-console")
[2023-11-15T07:29:06.426979Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console -> lib/susedistribution.pm:986 called testapi::assert_screen
[2023-11-15T07:29:06.427226Z] [debug] [pid:50570] <<< testapi::assert_screen(mustmatch="root-console", no_wait=1, timeout=30)
[2023-11-15T07:29:06.657636Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found root-console-top-20200304, similarity 1.00 @ 96/2
[2023-11-15T07:29:06.657988Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658190Z] [debug] [pid:50570] <<< testapi::assert_script_run(cmd="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", quiet=undef, timeout=90, fail_message="")
[2023-11-15T07:29:06.658346Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658532Z] [debug] [pid:50570] <<< testapi::type_string(string="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:09.742616Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:09.742951Z] [debug] [pid:50570] <<< testapi::type_string(string="; echo yv1NU-\$?- > /dev/hvc0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:10.806876Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:10.807177Z] [debug] [pid:50570] <<< testapi::wait_serial(timeout=90, expect_not_found=0, no_regex=0, buffer_size=undef, regexp=qr/yv1NU-\d+-/u, quiet=undef, record_output=undef)
[2023-11-15T07:29:12.910833Z] [debug] [pid:50570] >>> testapi::wait_serial: (?^u:yv1NU-\d+-): ok
[2023-11-15T07:29:13.063412Z] [info] [pid:50570] ::: basetest::runtest: # Test died: command 'snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'' failed at /usr/lib/os-autoinst/testapi.pm line 926.

so the snapshot is created, the VM is resumed and then the command "snapper list" is mistyped shortly after. As an experiment triggering tests with disabled snapshots to check on this hypothesis:

openqa-clone-job --repeat 80 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_3_disable_snapshots _GROUP=0 WORKER_CLASS=mania QEMU_DISABLE_SNAPSHOTS=1

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots

let's see if the reference builds or this one shows a change.

And for IPMI credentials

ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user 2 set name ADMIN
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user set password 2 …
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … channel setaccess 1 2 link=on ipmi=on callin=on privilege=4
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user enable 2

And I rebooted due to some failures like https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1980642#L1136

Failed to set net.ipv4.conf.br1.forwarding to 1: sysctl net.ipv4.conf.br1.forwarding does not exist

Fixed after reboot.

If problems regarding mistyping persists then maybe we can also tweak some parameters in

sudo salt -C 'G@osarch:ppc64le' cmd.run 'for i in /sys/module/kvm*/parameters/*; do echo "$i: $(cat $i)"; done'
petrol.qe.nue2.suse.org:
    /sys/module/kvm/parameters/halt_poll_ns: 10000
    /sys/module/kvm/parameters/halt_poll_ns_grow: 2
    /sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
    /sys/module/kvm/parameters/halt_poll_ns_shrink: 0
    /sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
    /sys/module/kvm_hv/parameters/h_ipi_redirect: 1
    /sys/module/kvm_hv/parameters/indep_threads_mode: Y
    /sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
    /sys/module/kvm_hv/parameters/nested: Y
    /sys/module/kvm_hv/parameters/one_vm_per_core: N
    /sys/module/kvm_hv/parameters/target_smt_mode: 0
diesel.qe.nue2.suse.org:
    /sys/module/kvm/parameters/halt_poll_ns: 10000
    /sys/module/kvm/parameters/halt_poll_ns_grow: 2
    /sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
    /sys/module/kvm/parameters/halt_poll_ns_shrink: 0
    /sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
    /sys/module/kvm_hv/parameters/h_ipi_redirect: 1
    /sys/module/kvm_hv/parameters/indep_threads_mode: Y
    /sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
    /sys/module/kvm_hv/parameters/nested: Y
    /sys/module/kvm_hv/parameters/one_vm_per_core: N
    /sys/module/kvm_hv/parameters/target_smt_mode: 0
mania.qe.nue2.suse.org:
    /sys/module/kvm/parameters/halt_poll_ns: 10000
    /sys/module/kvm/parameters/halt_poll_ns_grow: 2
    /sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
    /sys/module/kvm/parameters/halt_poll_ns_shrink: 0
    /sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
    /sys/module/kvm_hv/parameters/h_ipi_redirect: 1
    /sys/module/kvm_hv/parameters/indep_threads_mode: Y
    /sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
    /sys/module/kvm_hv/parameters/nested: Y
    /sys/module/kvm_hv/parameters/one_vm_per_core: N
    /sys/module/kvm_hv/parameters/target_smt_mode: 0

e.g. "one_vm_per_core"

Actions

Copy link

#12

Updated by okurz over 1 year ago

Related to action #60257: [functional][u][mistyping] random typing issues on wayland added

Actions

Copy link

#13

Updated by okurz over 1 year ago

Related to action #54914: [Leap-15.4] Mistyping in some modules: vlc, x_vt (formerly xorg_vt), etc. added

Actions

Copy link

#14

Updated by okurz over 1 year ago

Related to action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) added

Actions

Copy link

#15

Updated by okurz over 1 year ago

Related to action #42362: [qe-core][qem][virtio][sle15sp0][sle15sp1][desktop][typing] test fails in window_system because "typing string is too fast in wayland" added

Actions

Copy link

#16

Updated by okurz over 1 year ago

Related to action #42032: [sle][functional][u][tools] QA-power-8-*-kvm have missing keys (ninjakeys) added

Actions

Copy link

#17

Updated by okurz over 1 year ago

Related to action #40670: [sle][functional][u][medium][tools][sporadic] test fails in welcome - typos in kernel boot parameters added

Actions

Copy link

#18

Updated by okurz over 1 year ago

Related to action #26910: [ppc64le][tools][BLOCKED][bsc#1041747] PowerPC looses keys added

Actions

Copy link

#19

Updated by okurz over 1 year ago

Related to action #68218: [opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot added

Actions

Copy link

#20

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

#21

Updated by okurz over 1 year ago

Tags changed from infra, fc-basement, PowerPC to infra, fc-basement, PowerPC, next-frankencampus-visit

https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=1699983668665&to=1700055939342&viewPanel=56720 supports the hypothesis that snapshotting in tests is causing problems. As can be seen over the first hours openQA tests with qemu snapshots enabled have been running causing Disk I/O time to max out at 1s, later without snapshots the situation is much better. I can look into fitting SSDs in FC Basement.

Actions

Copy link

#22

Updated by okurz over 1 year ago

Related to action #30595: [ppc64le] Deploy bcache on tmpfs workers added

Actions

Copy link

#23

Updated by okurz over 1 year ago

all builds https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel, https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol, https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots show 100% pass ratio so maybe I could build in more performant ssds tomorrow. If that is not possible or does not help then we should look for more according storage and/or keep qemu snapshots disabled. Or tmpfs? But then how to ensure the directory structure is always created? Also see #30595 or #106841

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/678, merged, deploying.

Actions

Copy link

#24

Updated by okurz over 1 year ago

Monitoring https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-24h&to=now . mania is very stable. There were periods of stalling I/O and I am still considering to use a tmpfs but not for /var/lib/openqa/share but only /var/lib/openqa/share/pool so that I don't have a problem with missing directory structure.

I did

systemctl stop openqa-worker-auto-restart@{1..30}
rm -rf /var/lib/openqa/pool/*
mount -t tmpfs -o size=512G tmpfs /var/lib/openqa/pool
systemctl start openqa-worker-auto-restart@{1..50}

so 50 (!) instances to stress the system. Then triggered tests with

openqa-clone-job --repeat 100 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_4_50_instances_pool_on_tmpfs _GROUP=0 WORKER_CLASS=mania

-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_4_50_instances_pool_on_tmpfs with 8/100 fails and https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=1700161086904&to=1700207814794 showed the system to be clearly overwhelmed. So I stopped again instances 31..50

In the failures I see symptoms of lost keys or stuck keys so good to know what symptoms of overloaded systems are.

I wonder if we can share mania for manual testing as well as openQA worker. Maybe if we limit the ressources that libvirtd can take we can ensure a proper operation of openQA unharmed.

Actions

Copy link

#25

Updated by okurz over 1 year ago

Copied to action #150983: CPU Load and usage alert for openQA workers size:S added

Actions

Copy link

#27

Updated by okurz over 1 year ago

Description updated (diff)

I added to /etc/fstab

/dev/md/openqa /var/lib/openqa                          ext2   defaults              0  0
none /var/lib/openqa/pool tmpfs size=512G 0 0
openqa.suse.de:/var/lib/openqa/share		/var/lib/openqa/share	nfs	ro,noauto,nofail,retry=30,x-systemd.mount-timeout=30m,x-systemd.device-timeout=10m,x-systemd.automount	0 0

Right now a lot of jobs are running. Monitoring that and rebooting and putting back into salt when the job queue has reduced.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/679 to enable qemu snapshots again after tmpfs in place.

I am currently trying to add according alerts for CPU Load.

Actions

Copy link

#28

Updated by okurz over 1 year ago

Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/679

Actions

Copy link

#29

Updated by okurz over 1 year ago

Tags changed from infra, fc-basement, PowerPC, next-frankencampus-visit to infra, fc-basement, PowerPC
Description updated (diff)

merged and deployed. Monitoring over the next time. Also in particular waiting for next automatically triggered reboot.

Actions

Copy link

#30

Updated by okurz over 1 year ago

Due date deleted (~~2023-12-15~~)
Status changed from Feedback to Resolved

mania is up after the latest automatic reboot and all jobs that I could see are very stable. All good now. Let's stay with 30 instances.

Actions

Copy link

#31

Updated by jlausuch over 1 year ago

Related to action #138770: [BCI] Reduce coverage for ppc64le added

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #139271

Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Further details¶

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by jlausuch over 1 year ago