action #139271
closedRepurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M
0%
Description
Motivation¶
In FC Basement we have currently unused PowerPC hardware that maybe can be put to good use, namely
Acceptance criteria¶
- AC1: DONE mania is actively used
- AC2: DONE production openQA OSD jobs are able to execute successfully on mania
- AC3: DONE Our inventory management system is up-to-date regarding the use and configuration of mania
- AC4: DONE Workerconf has the IPMI command to access the machine
Suggestions¶
- DONE Offer the machines to be investigated for use in coordination with hsehic, former coordinator for those machines
- DONE Ensure the machine is used for something useful, e.g. openQA OSD worker
- DONE Add machine to DHCP/DNS config in https://gitlab.suse.de/OPS-Service/salt/
- DONE Boot the machine either over network or local install medium
- DONE Install openSUSE Leap 15.4
- DONE Deploy OSD openQA worker following somewhere from https://wiki.suse.net/index.php/OpenQA and https://gitlab.suse.de/openqa/salt-states-openqa
- DONE Ensure there are no related alerts
- DONE Optionally: Help with #136130 as mania could be the third machine to bisect where problems come from
Out of scope¶
ppc64le multi-machine jobs -> #136130
Further details¶
Potentially interesting and related: https://bugzilla.suse.com/show_bug.cgi?id=1041747, never fixed, workarounds mentioned like QEMUCPUS=1; QEMUTHREADS=1
which likely should not be needed anymore on recent OS kernels.
Updated by okurz 10 months ago
- Copied from action #136121: Repurpose PowerPC hardware in FC Basement - deagol added
Updated by okurz 10 months ago
- Related to coordination #139010: [epic] Long OSD ppc64le job queue added
Updated by okurz 10 months ago
- Status changed from New to In Progress
- Assignee set to okurz
- Target version changed from future to Ready
Connected power cables and system ethernet, updated racktables entry https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9588 . Logged into https://fsp1-mania.qe.nue2.suse.org/cgi-bin/cgi with admin/admin. IPMI access also works with ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P admin
without username. Over ASM menu "System Configuration > Firmware Configuration" set to OPAL and powered on over IPMI and connected SoL. Eventually petitboot came up. From there I could go to shell and confirm that we have network on the device "enP5p1s0f0" with L2 address 40:f2:e9:31:54:60 so I added that to racktables. Following #119008-6 I can come up with a netboot of the installer.
wget http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/boot/ppc64le/linux
wget http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/boot/ppc64le/initrd
kexec -l linux --initrd initrd --command-line 'install=http://download.opensuse.org/distribution/openSUSE-stable/repo/oss/ autoyast=https://is.gd/oqaay5 sshd=1 sshpassword=linux rootpassword=linux console=tty0 console=hvc0'
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4398 for an update of DHCP/DNS entries.
And in the installed system installed old kernel according to #116078-81 and package lock with
for i in kernel-default util-linux "kernel*"; do zypper al --comment "poo#119008, kernel regression boo#1202138" $i; done
Updated by openqa_review 10 months ago
- Due date set to 2023-11-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 10 months ago
I had to tweak the network setting during installation as the installer always insisted on using the not connected eth1 but eventually I ended up with a properly ssh reachable installed system although I ended up installing Leap 15.4 . I can try to reinstall or upgrade to Leap 15.5., default old root password
Updated by okurz 10 months ago
- Due date changed from 2023-11-25 to 2023-12-15
- Status changed from In Progress to Blocked
I checked storage performance and sda,sdb,sdc,sdd,sde,sdf,sdg all seem to have the same performance. So I would say it's ok to stay with root partition on sda, potentially change to RAID, and use something sdf+sdg, both 1.1TB, for /var/lib/openqa for a cache and pool.
mdadm --create /dev/md/openqa --level=0 --force --assume-clean --raid-devices=2 --run /dev/sd{f,g}
/sbin/mkfs.ext2 -F /dev/md/openqa
mkdir -p /var/lib/openqa
echo '/dev/md/openqa /var/lib/openqa ext2 defaults 0 0' >> /etc/fstab
mount -a
and now waiting for https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/4398 because I would like to prevent problems with no proper static FQDN.
Updated by okurz 10 months ago
- Related to action #136130: test fails in iscsi_client due to salt 'host'/'nodename' confusion size:M added
Updated by okurz 10 months ago
Added to salt, applied high state multiple times and triggered verification jobs:
openqa-clone-job --repeat 20 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz WORKER_CLASS=mania
-> https://openqa.suse.de/tests/overview?groupid=110&version=15-SP6&distri=sle&build=poo139271-okurz
one erroneous key repetition error in https://openqa.suse.de/tests/12811244#step/installation_snapshots/4, that's making it 1/20 fails, 5.0%, IMHO too much. Let's see if we can reproduce that more easily with more instances. On mania did
systemctl enable --now openqa-worker-auto-restart@{11..30}
and triggering 200 for a bigger set now with
openqa-clone-job --repeat 200 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2 _GROUP=0 WORKER_CLASS=mania
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2
that shows 12 fails vs. 158 passed so 7.6% failures, rather high. But No significant problem due to the increase of instances.
Asking in https://suse.slack.com/archives/C02CANHLANP/p1700035576832419 , maybe somebody has an idea
Hi, I am in the process of adding another Power8 machine for qemu-ppc64le kvm testing to OSD. As part of that I ran some openQA test jobs, e.g. see https://openqa.suse.de/tests/overview?build=poo139271-okurz_2 . There are some test failures, 3 in postgresql_server due to timeout on service startup, 7 due to key repetition in installation_snapshots. Do those failures ring a bell for anyone? Does anyone remember some special PowerPC kernel flag that would prevent that? If we can't solve that problem then I wouldn't add the machine to production worker classes hence next kernel live-patch testing might be delayed again significantly
And for reference calling:
for i in diesel petrol; do openqa-clone-job --repeat 40 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2-reference_$i _GROUP=0 WORKER_CLASS=$i; done
on diesel+petrol.
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol
My hypothesis for https://openqa.suse.de/tests/overview?modules=installation_snapshots&modules_result=failed&build=poo139271-okurz_2 is that rotating disk storage on mania is causing qemu snapshot creation to make virtual machines be partially unresponsive or sluggish and causing mistyping. https://openqa.suse.de/tests/12811789/logfile?filename=autoinst-log.txt showing
[2023-11-15T07:27:51.425927Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found cleared-console-root-20200612, similarity 1.00 @ 97/3
[2023-11-15T07:27:51.431856Z] [debug] [pid:50570] ||| finished hostname console (runtime: 73 s)
[2023-11-15T07:27:51.433277Z] [debug] [pid:50570] Creating a VM snapshot lastgood
…
[2023-11-15T07:28:52.597378Z] [debug] [pid:50623] Snapshot complete
[2023-11-15T07:29:06.015723Z] [debug] [pid:50623] EVENT {"event":"RESUME","timestamp":{"microseconds":15462,"seconds":1700033346}}
[2023-11-15T07:29:06.020447Z] [debug] [pid:50570] ||| starting installation_snapshots tests/console/installation_snapshots.pm
[2023-11-15T07:29:06.022067Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/hd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=21474836480 backing_file=/var/lib/openqa/pool/2/raid/hd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022158Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/cd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=541298688 backing_file=/var/lib/openqa/pool/2/raid/cd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022875Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console
[2023-11-15T07:29:06.023013Z] [debug] [pid:50570] <<< testapi::select_console(testapi_console="root-console")
[2023-11-15T07:29:06.426979Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console -> lib/susedistribution.pm:986 called testapi::assert_screen
[2023-11-15T07:29:06.427226Z] [debug] [pid:50570] <<< testapi::assert_screen(mustmatch="root-console", no_wait=1, timeout=30)
[2023-11-15T07:29:06.657636Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found root-console-top-20200304, similarity 1.00 @ 96/2
[2023-11-15T07:29:06.657988Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658190Z] [debug] [pid:50570] <<< testapi::assert_script_run(cmd="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", quiet=undef, timeout=90, fail_message="")
[2023-11-15T07:29:06.658346Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658532Z] [debug] [pid:50570] <<< testapi::type_string(string="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:09.742616Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:09.742951Z] [debug] [pid:50570] <<< testapi::type_string(string="; echo yv1NU-\$?- > /dev/hvc0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:10.806876Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:10.807177Z] [debug] [pid:50570] <<< testapi::wait_serial(timeout=90, expect_not_found=0, no_regex=0, buffer_size=undef, regexp=qr/yv1NU-\d+-/u, quiet=undef, record_output=undef)
[2023-11-15T07:29:12.910833Z] [debug] [pid:50570] >>> testapi::wait_serial: (?^u:yv1NU-\d+-): ok
[2023-11-15T07:29:13.063412Z] [info] [pid:50570] ::: basetest::runtest: # Test died: command 'snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'' failed at /usr/lib/os-autoinst/testapi.pm line 926.
so the snapshot is created, the VM is resumed and then the command "snapper list" is mistyped shortly after. As an experiment triggering tests with disabled snapshots to check on this hypothesis:
openqa-clone-job --repeat 80 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_3_disable_snapshots _GROUP=0 WORKER_CLASS=mania QEMU_DISABLE_SNAPSHOTS=1
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots
let's see if the reference builds or this one shows a change.
And for IPMI credentials
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user 2 set name ADMIN
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user set password 2 …
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … channel setaccess 1 2 link=on ipmi=on callin=on privilege=4
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user enable 2
And I rebooted due to some failures like https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1980642#L1136
Failed to set net.ipv4.conf.br1.forwarding to 1: sysctl net.ipv4.conf.br1.forwarding does not exist
Fixed after reboot.
If problems regarding mistyping persists then maybe we can also tweak some parameters in
sudo salt -C 'G@osarch:ppc64le' cmd.run 'for i in /sys/module/kvm*/parameters/*; do echo "$i: $(cat $i)"; done'
petrol.qe.nue2.suse.org:
/sys/module/kvm/parameters/halt_poll_ns: 10000
/sys/module/kvm/parameters/halt_poll_ns_grow: 2
/sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
/sys/module/kvm/parameters/halt_poll_ns_shrink: 0
/sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
/sys/module/kvm_hv/parameters/h_ipi_redirect: 1
/sys/module/kvm_hv/parameters/indep_threads_mode: Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
/sys/module/kvm_hv/parameters/nested: Y
/sys/module/kvm_hv/parameters/one_vm_per_core: N
/sys/module/kvm_hv/parameters/target_smt_mode: 0
diesel.qe.nue2.suse.org:
/sys/module/kvm/parameters/halt_poll_ns: 10000
/sys/module/kvm/parameters/halt_poll_ns_grow: 2
/sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
/sys/module/kvm/parameters/halt_poll_ns_shrink: 0
/sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
/sys/module/kvm_hv/parameters/h_ipi_redirect: 1
/sys/module/kvm_hv/parameters/indep_threads_mode: Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
/sys/module/kvm_hv/parameters/nested: Y
/sys/module/kvm_hv/parameters/one_vm_per_core: N
/sys/module/kvm_hv/parameters/target_smt_mode: 0
mania.qe.nue2.suse.org:
/sys/module/kvm/parameters/halt_poll_ns: 10000
/sys/module/kvm/parameters/halt_poll_ns_grow: 2
/sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
/sys/module/kvm/parameters/halt_poll_ns_shrink: 0
/sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
/sys/module/kvm_hv/parameters/h_ipi_redirect: 1
/sys/module/kvm_hv/parameters/indep_threads_mode: Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
/sys/module/kvm_hv/parameters/nested: Y
/sys/module/kvm_hv/parameters/one_vm_per_core: N
/sys/module/kvm_hv/parameters/target_smt_mode: 0
e.g. "one_vm_per_core"
Updated by okurz 10 months ago
- Related to action #60257: [functional][u][mistyping] random typing issues on wayland added
Updated by okurz 10 months ago
- Related to action #54914: [Leap-15.4] Mistyping in some modules: vlc, x_vt (formerly xorg_vt), etc. added
Updated by okurz 10 months ago
- Related to action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) added
Updated by okurz 10 months ago
- Related to action #42362: [qe-core][qem][virtio][sle15sp0][sle15sp1][desktop][typing] test fails in window_system because "typing string is too fast in wayland" added
Updated by okurz 10 months ago
- Related to action #42032: [sle][functional][u][tools] QA-power-8-*-kvm have missing keys (ninjakeys) added
Updated by okurz 10 months ago
- Related to action #40670: [sle][functional][u][medium][tools][sporadic] test fails in welcome - typos in kernel boot parameters added
Updated by okurz 10 months ago
- Related to action #26910: [ppc64le][tools][BLOCKED][bsc#1041747] PowerPC looses keys added
Updated by okurz 10 months ago
- Related to action #68218: [opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot added
Updated by okurz 10 months ago
- Tags changed from infra, fc-basement, PowerPC to infra, fc-basement, PowerPC, next-frankencampus-visit
https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=1699983668665&to=1700055939342&viewPanel=56720 supports the hypothesis that snapshotting in tests is causing problems. As can be seen over the first hours openQA tests with qemu snapshots enabled have been running causing Disk I/O time to max out at 1s, later without snapshots the situation is much better. I can look into fitting SSDs in FC Basement.
Updated by okurz 10 months ago
- Related to action #30595: [ppc64le] Deploy bcache on tmpfs workers added
Updated by okurz 10 months ago
all builds https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel, https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol, https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots show 100% pass ratio so maybe I could build in more performant ssds tomorrow. If that is not possible or does not help then we should look for more according storage and/or keep qemu snapshots disabled. Or tmpfs? But then how to ensure the directory structure is always created? Also see #30595 or #106841
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/678, merged, deploying.
Updated by okurz 10 months ago
Monitoring https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-24h&to=now . mania is very stable. There were periods of stalling I/O and I am still considering to use a tmpfs but not for /var/lib/openqa/share but only /var/lib/openqa/share/pool so that I don't have a problem with missing directory structure.
I did
systemctl stop openqa-worker-auto-restart@{1..30}
rm -rf /var/lib/openqa/pool/*
mount -t tmpfs -o size=512G tmpfs /var/lib/openqa/pool
systemctl start openqa-worker-auto-restart@{1..50}
so 50 (!) instances to stress the system. Then triggered tests with
openqa-clone-job --repeat 100 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_4_50_instances_pool_on_tmpfs _GROUP=0 WORKER_CLASS=mania
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_4_50_instances_pool_on_tmpfs with 8/100 fails and https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=1700161086904&to=1700207814794 showed the system to be clearly overwhelmed. So I stopped again instances 31..50
In the failures I see symptoms of lost keys or stuck keys so good to know what symptoms of overloaded systems are.
I wonder if we can share mania for manual testing as well as openQA worker. Maybe if we limit the ressources that libvirtd can take we can ensure a proper operation of openQA unharmed.
Updated by okurz 10 months ago
- Copied to action #150983: CPU Load and usage alert for openQA workers size:S added
Updated by okurz 10 months ago
- Description updated (diff)
I added to /etc/fstab
/dev/md/openqa /var/lib/openqa ext2 defaults 0 0
none /var/lib/openqa/pool tmpfs size=512G 0 0
openqa.suse.de:/var/lib/openqa/share /var/lib/openqa/share nfs ro,noauto,nofail,retry=30,x-systemd.mount-timeout=30m,x-systemd.device-timeout=10m,x-systemd.automount 0 0
Right now a lot of jobs are running. Monitoring that and rebooting and putting back into salt when the job queue has reduced.
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/679 to enable qemu snapshots again after tmpfs in place.
I am currently trying to add according alerts for CPU Load.
Updated by jlausuch 10 months ago
- Related to action #138770: [BCI] Reduce coverage for ppc64le added