Added to salt, applied high state multiple times and triggered verification jobs:
openqa-clone-job --repeat 20 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz WORKER_CLASS=mania
-> https://openqa.suse.de/tests/overview?groupid=110&version=15-SP6&distri=sle&build=poo139271-okurz
one erroneous key repetition error in https://openqa.suse.de/tests/12811244#step/installation_snapshots/4, that's making it 1/20 fails, 5.0%, IMHO too much. Let's see if we can reproduce that more easily with more instances. On mania did
systemctl enable --now openqa-worker-auto-restart@{11..30}
and triggering 200 for a bigger set now with
openqa-clone-job --repeat 200 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2 _GROUP=0 WORKER_CLASS=mania
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2
that shows 12 fails vs. 158 passed so 7.6% failures, rather high. But No significant problem due to the increase of instances.
Asking in https://suse.slack.com/archives/C02CANHLANP/p1700035576832419 , maybe somebody has an idea
Hi, I am in the process of adding another Power8 machine for qemu-ppc64le kvm testing to OSD. As part of that I ran some openQA test jobs, e.g. see https://openqa.suse.de/tests/overview?build=poo139271-okurz_2 . There are some test failures, 3 in postgresql_server due to timeout on service startup, 7 due to key repetition in installation_snapshots. Do those failures ring a bell for anyone? Does anyone remember some special PowerPC kernel flag that would prevent that? If we can't solve that problem then I wouldn't add the machine to production worker classes hence next kernel live-patch testing might be delayed again significantly
And for reference calling:
for i in diesel petrol; do openqa-clone-job --repeat 40 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_2-reference_$i _GROUP=0 WORKER_CLASS=$i; done
on diesel+petrol.
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_diesel
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_2-reference_petrol
My hypothesis for https://openqa.suse.de/tests/overview?modules=installation_snapshots&modules_result=failed&build=poo139271-okurz_2 is that rotating disk storage on mania is causing qemu snapshot creation to make virtual machines be partially unresponsive or sluggish and causing mistyping. https://openqa.suse.de/tests/12811789/logfile?filename=autoinst-log.txt showing
[2023-11-15T07:27:51.425927Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found cleared-console-root-20200612, similarity 1.00 @ 97/3
[2023-11-15T07:27:51.431856Z] [debug] [pid:50570] ||| finished hostname console (runtime: 73 s)
[2023-11-15T07:27:51.433277Z] [debug] [pid:50570] Creating a VM snapshot lastgood
…
[2023-11-15T07:28:52.597378Z] [debug] [pid:50623] Snapshot complete
[2023-11-15T07:29:06.015723Z] [debug] [pid:50623] EVENT {"event":"RESUME","timestamp":{"microseconds":15462,"seconds":1700033346}}
[2023-11-15T07:29:06.020447Z] [debug] [pid:50570] ||| starting installation_snapshots tests/console/installation_snapshots.pm
[2023-11-15T07:29:06.022067Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/hd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=21474836480 backing_file=/var/lib/openqa/pool/2/raid/hd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022158Z] [debug] [pid:50623] QEMU: Formatting '/var/lib/openqa/pool/2/raid/cd0-overlay4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=541298688 backing_file=/var/lib/openqa/pool/2/raid/cd0-overlay3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
[2023-11-15T07:29:06.022875Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console
[2023-11-15T07:29:06.023013Z] [debug] [pid:50570] <<< testapi::select_console(testapi_console="root-console")
[2023-11-15T07:29:06.426979Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:21 called testapi::select_console -> lib/susedistribution.pm:986 called testapi::assert_screen
[2023-11-15T07:29:06.427226Z] [debug] [pid:50570] <<< testapi::assert_screen(mustmatch="root-console", no_wait=1, timeout=30)
[2023-11-15T07:29:06.657636Z] [debug] [pid:50570] >>> testapi::_handle_found_needle: found root-console-top-20200304, similarity 1.00 @ 96/2
[2023-11-15T07:29:06.657988Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658190Z] [debug] [pid:50570] <<< testapi::assert_script_run(cmd="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", quiet=undef, timeout=90, fail_message="")
[2023-11-15T07:29:06.658346Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:06.658532Z] [debug] [pid:50570] <<< testapi::type_string(string="snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:09.742616Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:09.742951Z] [debug] [pid:50570] <<< testapi::type_string(string="; echo yv1NU-\$?- > /dev/hvc0\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2023-11-15T07:29:10.806876Z] [debug] [pid:50570] tests/console/installation_snapshots.pm:41 called testapi::assert_script_run
[2023-11-15T07:29:10.807177Z] [debug] [pid:50570] <<< testapi::wait_serial(timeout=90, expect_not_found=0, no_regex=0, buffer_size=undef, regexp=qr/yv1NU-\d+-/u, quiet=undef, record_output=undef)
[2023-11-15T07:29:12.910833Z] [debug] [pid:50570] >>> testapi::wait_serial: (?^u:yv1NU-\d+-): ok
[2023-11-15T07:29:13.063412Z] [info] [pid:50570] ::: basetest::runtest: # Test died: command 'snapper list --type single | tee -a /dev/hvc0 | grep 'after installation.*important=yes'' failed at /usr/lib/os-autoinst/testapi.pm line 926.
so the snapshot is created, the VM is resumed and then the command "snapper list" is mistyped shortly after. As an experiment triggering tests with disabled snapshots to check on this hypothesis:
openqa-clone-job --repeat 80 --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de/tests/12798632 TEST+=-poo139271-okurz BUILD=poo139271-okurz_3_disable_snapshots _GROUP=0 WORKER_CLASS=mania QEMU_DISABLE_SNAPSHOTS=1
-> https://openqa.suse.de/tests/overview?build=poo139271-okurz_3_disable_snapshots
let's see if the reference builds or this one shows a change.
And for IPMI credentials
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user 2 set name ADMIN
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user set password 2 …
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … channel setaccess 1 2 link=on ipmi=on callin=on privilege=4
ipmitool -I lanplus -H fsp1-mania.qe.nue2.suse.org -P … user enable 2
And I rebooted due to some failures like https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1980642#L1136
Failed to set net.ipv4.conf.br1.forwarding to 1: sysctl net.ipv4.conf.br1.forwarding does not exist
Fixed after reboot.
If problems regarding mistyping persists then maybe we can also tweak some parameters in
sudo salt -C 'G@osarch:ppc64le' cmd.run 'for i in /sys/module/kvm*/parameters/*; do echo "$i: $(cat $i)"; done'
petrol.qe.nue2.suse.org:
/sys/module/kvm/parameters/halt_poll_ns: 10000
/sys/module/kvm/parameters/halt_poll_ns_grow: 2
/sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
/sys/module/kvm/parameters/halt_poll_ns_shrink: 0
/sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
/sys/module/kvm_hv/parameters/h_ipi_redirect: 1
/sys/module/kvm_hv/parameters/indep_threads_mode: Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
/sys/module/kvm_hv/parameters/nested: Y
/sys/module/kvm_hv/parameters/one_vm_per_core: N
/sys/module/kvm_hv/parameters/target_smt_mode: 0
diesel.qe.nue2.suse.org:
/sys/module/kvm/parameters/halt_poll_ns: 10000
/sys/module/kvm/parameters/halt_poll_ns_grow: 2
/sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
/sys/module/kvm/parameters/halt_poll_ns_shrink: 0
/sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
/sys/module/kvm_hv/parameters/h_ipi_redirect: 1
/sys/module/kvm_hv/parameters/indep_threads_mode: Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
/sys/module/kvm_hv/parameters/nested: Y
/sys/module/kvm_hv/parameters/one_vm_per_core: N
/sys/module/kvm_hv/parameters/target_smt_mode: 0
mania.qe.nue2.suse.org:
/sys/module/kvm/parameters/halt_poll_ns: 10000
/sys/module/kvm/parameters/halt_poll_ns_grow: 2
/sys/module/kvm/parameters/halt_poll_ns_grow_start: 10000
/sys/module/kvm/parameters/halt_poll_ns_shrink: 0
/sys/module/kvm_hv/parameters/dynamic_mt_modes: 6
/sys/module/kvm_hv/parameters/h_ipi_redirect: 1
/sys/module/kvm_hv/parameters/indep_threads_mode: Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass: 1
/sys/module/kvm_hv/parameters/nested: Y
/sys/module/kvm_hv/parameters/one_vm_per_core: N
/sys/module/kvm_hv/parameters/target_smt_mode: 0
e.g. "one_vm_per_core"