Project

General

Profile

Actions

action #158266

closed

openQA jobs on diesel ppc64le fail due to auto_review:"QEMU: This is probably because your SMT is enabled."

Added by okurz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
2024-03-29
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

From https://suse.slack.com/archives/C02CANHLANP/p1711700522125619

Warning: tests are failing on ppc64 worker host diesel around 5 hours ago, seem qemu VM can't start. https://openqa.suse.de/admin/workers/3393 https://openqa.suse.de/admin/workers/3388 https://openqa.suse.de/admin/workers/3390

autoinst-log.txt says

[2024-03-29T09:37:43.496499+01:00] [debug] [pid:18748] QEMU: error: kvm run failed Device or resource busy
[2024-03-29T09:37:43.496606+01:00] [debug] [pid:18748] QEMU: This is probably because your SMT is enabled.
[2024-03-29T09:37:43.496679+01:00] [debug] [pid:18748] QEMU: VCPU can only run on primary threads with all secondary threads offline.

There is the "smt_off" service https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/worker.sls?ref_type=heads#L263 to fix the problem regarding SMT. the service was running fine but I restarted the service and restarted https://openqa.suse.de/tests/13906928#live. But it seems it reproduces the problem.

Only diesel is affected, mania and petrol seem fine.

Suggestions

  • DONE ssh osd 'sudo salt-key -y -d diesel.qe.nue2.suse.org'
  • DONE ssh diesel.qe.nue2.suse.org 'sed -i 's/qemu_ppc64le,/qemu_ppc64le-poo158266,/' /etc/openqa/workers.ini && systemctl restart openqa-worker-auto-restart@{1..8} && systemctl disable --now salt-minion telegraf'
  • DONE host=openqa.suse.de WORKER=diesel result="result='failed'" comment="label:poo158266" ./openqa-advanced-retrigger-jobs
  • Investigate what is different on diesel vs. mania+petrol. Maybe mania+petrol are also affected but not noticed yet, maybe they haven't rebooted yet
  • Fix the problem, optionally wait for reported bug
  • verify
  • rollback

Rollback actions

  • DONE ssh diesel.qe.nue2.suse.org 'sed -i 's/qemu_ppc64le-poo158266,/qemu_ppc64le,/' /etc/openqa/workers.ini && systemctl restart openqa-worker-auto-restart@{1..8} && systemctl enable --now salt-minion telegraf'
  • DONE ssh osd 'sudo salt-key -y -a diesel.qe.nue2.suse.org'
  • ssh root@kerosene.qe.nue2.suse.org 'zypper rl powerpc-utils && zypper -n in powerpc-utils'
  • Revert https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/763
Actions #1

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 9 months ago · Edited

okurz@openqa:~> sudo salt -C 'G@osarch:ppc64le' cmd.run 'ppc64_cpu --smt'
petrol.qe.nue2.suse.org:
    SMT is off
grenache-1.oqa.prg2.suse.org:
    SMT=8
mania.qe.nue2.suse.org:
    SMT is off

and also on diesel I can see SMT=8 so clearly the smt_off service did not do it's job. I did an explicit systemctl stop smt_off && systemctl start smt_off but also that did not have an effect. Also ppc64_cpu --smt=off is not effective. Reboot not effective.

Trying alternate paths to disable SMT already on boot time. According to https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html there is a parameter "nosmt" also available for "PPC, but we run kernel 5.3 where that flag does not yet seem to be supported according to https://www.kernel.org/doc/html/v5.3/admin-guide/kernel-parameters.html. The patch adding to PPC is described in https://lwn.net/Articles/936829/

Also

$ cat /sys/devices/system/cpu/smt/control
notimplemented

Trying a complete poweroff and on again. No effect.

Maybe it's also that an upgrade happened and diesel rebooted – on request or by accident – but petrol+mania haven't rebooted yet? On petrol in journalctl auto-update I see that as expected kernel-default and multiple related patches are locked but at least another patch for "sed" was installed. That will certainly not trigger a request for reboot but petrol and mania show that they have been online since 58 days so there could be other updates that accumulated which might be at fault here. systemctl status says that petrol is up since Tue 2024-01-30 15:54:02 CET; 1 month 28 days ago so we should check /var/log/zypp/history what was installed since then. Or try an earlier snapshot.

On diesel snapper list says

# snapper list
    # | Type   | Pre # | Date                     | User | Used Space | Cleanup | Description            | Userdata     
------+--------+-------+--------------------------+------+------------+---------+------------------------+--------------
   0  | single |       |                          | root |            |         | current                |              
2184* | single |       | Thu Jul 28 12:57:20 2022 | root |  16.50 MiB |         | writable copy of #2181 |              
3166  | pre    |       | Tue Dec 12 06:34:11 2023 | root | 339.06 MiB | number  | zypp(zypper)           | important=yes
3167  | post   |  3166 | Tue Dec 12 06:35:00 2023 | root | 247.69 MiB | number  |                        | important=yes
3292  | pre    |       | Thu Jan 25 02:03:10 2024 | root | 313.69 MiB | number  | zypp(zypper)           | important=yes
3293  | post   |  3292 | Thu Jan 25 02:08:18 2024 | root | 241.19 MiB | number  |                        | important=yes
3440  | pre    |       | Thu Mar 14 02:03:08 2024 | root | 292.81 MiB | number  | zypp(zypper)           | important=yes
3441  | post   |  3440 | Thu Mar 14 02:05:31 2024 | root | 176.69 MiB | number  |                        | important=yes
3448  | pre    |       | Sat Mar 16 02:02:51 2024 | root | 174.44 MiB | number  | zypp(zypper)           | important=yes
3449  | post   |  3448 | Sat Mar 16 02:03:12 2024 | root | 176.62 MiB | number  |                        | important=yes
3472  | pre    |       | Tue Mar 26 02:02:57 2024 | root | 160.81 MiB | number  | zypp(zypper)           | important=yes
3473  | post   |  3472 | Tue Mar 26 02:03:08 2024 | root | 158.19 MiB | number  |                        | important=yes
3482  | pre    |       | Thu Mar 28 02:03:05 2024 | root | 182.06 MiB | number  | zypp(zypper)           | important=no 
3483  | post   |  3482 | Thu Mar 28 02:03:39 2024 | root |   5.50 MiB | number  |                        | important=no 
3484  | pre    |       | Thu Mar 28 06:14:51 2024 | root | 960.00 KiB | number  | zypp(zypper)           | important=no 
3485  | post   |  3484 | Thu Mar 28 06:15:14 2024 | root |   1.75 MiB | number  |                        | important=no 
3486  | pre    |       | Fri Mar 29 02:03:14 2024 | root |   7.06 MiB | number  | zypp(zypper)           | important=no 
3487  | post   |  3486 | Fri Mar 29 02:03:32 2024 | root | 576.00 KiB | number  |                        | important=no 
3488  | pre    |       | Fri Mar 29 02:04:08 2024 | root | 512.00 KiB | number  | zypp(zypper)           | important=no 
3489  | post   |  3488 | Fri Mar 29 02:04:15 2024 | root |   1.12 MiB | number  |                        | important=no 
3490  | pre    |       | Fri Mar 29 06:14:53 2024 | root |   1.69 MiB | number  | zypp(zypper)           | important=no 
3491  | post   |  3490 | Fri Mar 29 06:15:16 2024 | root |   1.56 MiB | number  |                        | important=no 

The snapshots on date 2024-01-25 correspond to the following section from /var/log/zypp/history:

2024-01-25 02:04:25|command|root@diesel|'zypper' '-n' '--no-refresh' '--non-interactive-include-reboot-pat
ches' 'patch' '--replacefiles' '--auto-agree-with-licenses' '--download-in-advance'|
2024-01-25 02:04:30|install|ghostscript|9.52-150000.180.1|ppc64le||repo-sle-update|f54ef118624a5ab3f6f7dd7
1a916042771f3cc23d3de2147b2ba4dc17f768ba0|
2024-01-25 02:04:30|install|ghostscript-x11|9.52-150000.180.1|ppc64le||repo-sle-update|698afe39e0b153c3121
d412fb893bcadd4e7c70447c1fbfea936313450d4c26f|
2024-01-25 02:04:30|install|libsystemd0|249.17-150400.8.40.1|ppc64le||repo-sle-update|38c0b17187c869d87b05
ada1fb52650f3b6da409cc6af08f65301b8ac1401f57|
2024-01-25 02:04:31|install|libudev1|249.17-150400.8.40.1|ppc64le||repo-sle-update|738ffa24b6b2dbe745dc60e
eefaa838b1bae2ccfec2ad0b9cce5d2e24546fe19|
2024-01-25 02:04:48|install|systemd|249.17-150400.8.40.1|ppc64le||repo-sle-update|73c2c0fe61ff700f2a1ed43b
bea574bb4a5177a15c833e2a2653a4eb6a644378|
2024-01-25 02:04:57|install|systemd-container|249.17-150400.8.40.1|ppc64le||repo-sle-update|05866df31f992d
77cf1d421ea9cbeb077b096d3abef7952e0d590ed0b1571bf7|
2024-01-25 02:04:57|install|yast2-pkg-bindings|4.5.3-150500.3.3.1|ppc64le||repo-sle-update|01afd2e363728a7
bb41773c4659a1d9b95af418e6ee9d14cd7c09a23e2ed5e05|
2024-01-25 02:05:09|install|systemd-network|249.17-150400.8.40.1|ppc64le||repo-sle-update|0af13c615c334824
1cad1a3b52a6cf16a8c88f73acf6c65eb84904da49f348a7|
2024-01-25 02:05:18|install|udev|249.17-150400.8.40.1|ppc64le||repo-sle-update|f50c81b9ade932af51457528f9d
017d33b46d2866d99c3fcdd97d9dd87769f4e|
2024-01-25 02:05:18|install|systemd-sysvinit|249.17-150400.8.40.1|ppc64le||repo-sle-update|7fbd22b5b6e6abb
fe9a24813a71ee7db687cceb040edadb67452f4deb06c42a0|
2024-01-25 02:05:18|install|systemd-lang|249.17-150400.8.40.1|noarch||repo-sle-update|97c2f4cfcadf70bd920e
f694516249ad2b8d1e2f33b98814b85f482aaeba4947|
2024-01-25 02:05:19|install|systemd-coredump|249.17-150400.8.40.1|ppc64le||repo-sle-update|7db32c9702d820f
2abbc2e0c4f07a37d0e0847dd0ce25ada65067e2141387bcf|

so I could try 3293 which was before the last boot of petrol+mania and check if SMT is still disabled in that one. After boot I could confirm that actually SMT is off. zypper dup shows a bigger list of pending updates. Let's first crosscheck without installing anything more, mv /etc/systemd/system/auto-update.timer{,.disabled-poo158266} and reboot. After reboot SMT still off. Now trying to install only some packages from pending zypper dup and check effect:

sudo zypper in perl-Mojolicious-Plugin-AssetPack perl-Mojo-IOLoop-ReadWriteProcess python3-nftables yast2-theme yast2-packager yast2-network yast2-logs yast2-http-server yast2 xorg-x11-server-Xvfb xorg-x11-server wicked-service wicked wget-lang wget webkit2gtk-4_0-injected-bundles timezone stress-ng-bash-completion stress-ng sed-lang sed WebKitGTK-4.0-lang cmake cmake-full os-autoinst openQA-worker

After that I should try

sudo zypper -n in aaa_base aaa_base-extras audit bind-utils cmake-man containerd
  coreutils coreutils-doc coreutils-lang cpio cpio-lang cpio-mt cpp7 dhcp docker docker-bash-completion docker-rootless-extras docker-zsh-completion gcc7 gcc7-c++ gdb ghostscript ghostscript-x11 git-core
  git-gui gitk glibc glibc-devel glibc-extra glibc-i18ndata glibc-lang glibc-locale glibc-locale-base
  gnutls grub2 grub2-powerpc-ieee1275 grub2-powerpc-ieee1275-extras grub2-snapper-plugin
  grub2-systemd-sleep-plugin hwdata inst-source-utils kdump kpartx krb5 libOpenCL1 libaom3 libasan4
  libaudit1 libauparse0 libavahi-client3 libavahi-common3 libavahi-glib1 libavcodec57 libavformat57
  libavif13 libavresample3 libavutil55 libbluetooth3 libdns_sd libduktape206 libfreebl3 libgfortran4
  libgif7 libgnutls30 libjasper4 libjavascriptcoregtk-4_0-18 libmaxminddb0 libmetalink3 libmpath0
  libnetpbm11 libnftables1 libopenblas_openmp0 libopenblas_pthreads0 libopenssl-1_1-devel libopenssl1_0_0
  libopenssl1_1 libopenssl3 libopenvswitch-2_14-0 libpq5 libpython2_7-1_0 libpython3_6m1_0 libqrencode4
  libsoftokn3 libsource-highlight4 libssh-config libssh2-1 libssh4 libstdc++6-devel-gcc7 libswresample2
  libswscale4 libtesseract5 libtiff5 libubsan0 libuv1 libwebkit2gtk-4_0-37 libxkbcommon-x11-0
  libxkbcommon0 libxml2-2 libxml2-tools login_defs mozilla-nss mozilla-nss-certs multipath-tools netcfg
  netpbm nftables nscd openssh openssh-askpass-gnome openssh-clients openssh-common openssh-helpers
  openssh-server openssl-1_1 openvswitch pam-config perl-Bootloader perl-Git powerpc-utils ppc64-diag
  python python-base python-curses python-xml python3 python3-M2Crypto python3-attrs python3-base
  python3-bind python3-curses python3-dbm python3-nftables python3-pycryptodome python3-pyserial
  python3-rpm python3-tk rpm runc sed sed-lang shadow stress-ng stress-ng-bash-completion sudo
  sudo-plugin-python supportutils suse-module-tools system-group-audit systemd-presets-common-SUSE
  systemd-rpm-macros tesseract-ocr timezone webkit2gtk-4_0-injected-bundles wget wget-lang wicked
  wicked-service xorg-x11-server xorg-x11-server-Xvfb yast2 yast2-http-server yast2-logs yast2-network
  yast2-packager yast2-theme

I see "powerpc-utils" in the list which includes the application "ppc64_cpu". I did zypper al -m "poo#158266: ppc64_cpu --smt=off becomes ineffective" powerpc-utils and will first try to upgrade all other packages and see if that is also problematic.

Installed, rebooted, ppc64_cpu --smt shows off so all good. The effective upgrade was powerpc-utils 1.3.11-150500.3.6.1 -> 1.3.11-150500.3.14.3, let's see if we can bisect.

One version in between, powerpc-utils-1.3.11-150500.3.9.1.ppc64le, not affected.

Crosschecking again 1.3.11-150500.3.9.1 -> 1.3.11-150500.3.14.3.

ppc64_cpu --smt && ppc64_cpu --smt=on && ppc64_cpu --smt && ppc64_cpu --smt=off && ppc64_cpu --smt

shows as expected

SMT=8
SMT=8
SMT=8

let's try if we need the reboot after downgrade:

zypper in --oldpackage powerpc-utils-1.3.11-150500.3.9.1
ppc64_cpu --smt && ppc64_cpu --smt=on && ppc64_cpu --smt && ppc64_cpu --smt=off && ppc64_cpu --smt
SMT is off
One or more cpus could not be on/offlined

so it's now off but I can't switch it on again. Back upgrading to the affected version

zypper -n in powerpc-utils-1.3.11-150500.3.14.3.ppc64le
ppc64_cpu --smt && ppc64_cpu --smt=on && ppc64_cpu --smt && ppc64_cpu --smt=off && ppc64_cpu --smt
SMT is off
SMT is off
SMT is off

so apparently the new version can't change it from off but was we already see after reboot if it's on it can't switch off.

I will create a proper lock in all our o3&osd ppc machines and then create a bugzilla entry

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/763 for OSD. Merged right now to have the package lock effective before petrol+mania are affected. I downgraded on kerosene (o3), petrol+mania

Actions #3

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 9 months ago

  • Description updated (diff)
  • Status changed from In Progress to Blocked
  • Priority changed from Urgent to Low
Actions #5

Updated by okurz 9 months ago · Edited

  • Status changed from Blocked to In Progress

trying out a development version of powerpc-utils from https://build.opensuse.org/projects/hardware/packages/powerpc-utils/repositories/15.5/binaries as suggested by msuchanek. For this I disabled diesel from production again and removed the salt key.

Actions #6

Updated by okurz 9 months ago · Edited

powerpc-utils-1.3.12-lp155.263.1.ppc64le has no problems switching off SMT. So that version seems to be better than the affected 1.3.11-150500.3.14.3. I started telegraf, salt-minion and openQA worker instances again.

Actions #7

Updated by okurz 9 months ago

  • Status changed from In Progress to Blocked

There is now a maintenance submission with a fix, https://build.suse.de/request/show/325169. We are currently running the development version manually installed which will likely be kept until there is a newer version available from repos. Blocking on https://bugzilla.suse.com/show_bug.cgi?id=1222163

Actions #8

Updated by okurz 8 months ago

  • Status changed from Blocked to New

https://bugzilla.suse.com/show_bug.cgi?id=1222163 is fixed, time to ensure that the by now released maintenance update is installed

Actions #9

Updated by okurz 8 months ago · Edited

  • Due date set to 2024-05-05
  • Status changed from New to Feedback

sudo zypper se --details powerpc-utils shows that the package version with the fix is available so removing the zypper lock from salt again: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/785 (merged)

sudo salt --no-color --state-output=changes \* cmd.run 'zypper rl powerpc-utils && rpm -q powerpc-utils && zypper -n in powerpc-utils' queue=True | grep -v 'Result: Clean - Started'

and actually a bit more manual cleanup of locks for conflicting patches and then ensure that the right version of the package powerpc-utils is installed. Then triggered a reboot and waiting till sudo salt -C 'G@osarch:ppc64le' cmd.run 'ppc64_cpu --smt' returns the expected "SMT is off" from all.

Actions #10

Updated by okurz 8 months ago

  • Due date deleted (2024-05-05)
  • Status changed from Feedback to Resolved

Verified, should be good.

Actions

Also available in: Atom PDF