action #68218
closed[opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot
0%
Description
test fails in hostname because "rcu_sched kthread starved" while VM snapshot
Observation¶
openQA test in scenario opensuse-Tumbleweed-DVD-ppc64le-create_hdd_gnome@ppc64le fails in
hostname
as per two attached logs, the "rcu_sched kthread starved"
is reported two times in same time frame as VM snapshots.
hostname module failed because qemu seems to be overloaded
and some typed characters are lost
typed: "hostnamectl set-hostname susetest"
shell: "hostn"
=== extract hostname-journal.log: (timestamp here is 4 hours before UTC)
Jun 18 08:19:38.525735 localhost.localdomain dbus-daemon[989]: [system] Failed to activate service 'org.bluez': timed out (service_start_timeout=25000ms)
Jun 18 08:19:38.528317 localhost.localdomain pulseaudio[1687]: E: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Jun 18 08:19:38.531655 localhost.localdomain geoclue[1925]: Service not used for 60 seconds. Shutting down..
Jun 18 08:19:38.561998 localhost.localdomain systemd[1]: geoclue.service: Succeeded.
Jun 18 08:19:38.609279 localhost.localdomain kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun 18 08:19:38.609376 localhost.localdomain kernel: rcu: 0-...!: (1 ticks this GP) idle=fee/1/0x4000000000000002 softirq=39401/39401 fqs=0
Jun 18 08:19:38.609412 localhost.localdomain kernel: (t=8411 jiffies g=14885 q=719)
Jun 18 08:19:38.609438 localhost.localdomain kernel: rcu: rcu_sched kthread starved for 8411 jiffies! g14885 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
...
Jun 18 08:19:38.610159 localhost.localdomain kernel: [c0000000c6ba3e20] [c000000000009a34] decrementer_common_virt+0x174/0x180
Jun 18 08:19:38.609047 localhost.localdomain rtkit-daemon[1717]: The canary thread is apparently starving. Taking action.
...
Jun 18 08:23:52.048693 localhost.localdomain login[2469]: ROOT LOGIN ON hvc1
Jun 18 08:25:44.544376 localhost.localdomain kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun 18 08:25:44.596688 localhost.localdomain kernel: rcu: 0-...!: (1 ticks this GP) idle=5ca/0/0x1 softirq=49769/49769 fqs=0
Jun 18 08:25:44.596817 localhost.localdomain kernel: (t=10047 jiffies g=21097 q=978)
Jun 18 08:25:44.602524 localhost.localdomain kernel: rcu: rcu_sched kthread starved for 10047 jiffies! g21097 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
...
Jun 18 08:25:44.604829 localhost.localdomain kernel: [c000000001babf90] [c00000000000c49c] start_here_common+0x1c/0x400
Jun 18 08:25:44.543549 localhost.localdomain rtkit-daemon[1717]: The canary thread is apparently starving. Taking action.
===
=== extract autoinst-log.txt
[2020-06-18T12:18:14.456 UTC] [debug] Creating a VM snapshot lastgood
...
[2020-06-18T12:19:46.333 UTC] [debug] Migrating remaining bytes: 0
[2020-06-18T12:19:46.334 UTC] [debug] Snapshot complete
...
[2020-06-18T12:19:46.605 UTC] [debug] ||| starting system_prepare tests/console/system_prepare.pm
...
[2020-06-18T12:24:12.233 UTC] [debug] Creating a VM snapshot lastgood
...
[2020-06-18T12:25:49.587 UTC] [debug] Migrating remaining bytes: 0
[2020-06-18T12:25:49.588 UTC] [debug] Snapshot complete
...
[2020-06-18T12:25:49.643 UTC] [debug] ||| starting hostname tests/console/hostname.pm
[2020-06-18T12:25:49.644 UTC] [debug] tests/console/hostname.pm:32 called testapi::select_console
...
[2020-06-18T12:25:50.642 UTC] [debug] <<< testapi::type_string(string="hostnamectl set-hostname susetest", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
...
===
Test suite description¶
image creation job used as parent for other jobs testing based on existing installation. GUI-Installation and selecting the gnome desktop during installation.
Reproducible¶
Fails since (at least) Build 20200617
Expected result¶
Last good: 20200616 (or more recent)
Further details¶
Always latest result in this scenario: latest
Files
Updated by michel_mno over 4 years ago
- Related to action #44156: [opensuse][functional][u][sporadic][ppc64le] test fails in yast2_lan because rcu message in user-console added
Updated by michel_mno over 4 years ago
I added ref to issue#44156 for which we have a bypass with poo44156 needles.
BUT for this new issue#68218, there are rcu messages in root/user console, but also SUT overload because lost characters in typed commands of new module after VM restart.
This problem found on different tests:
- create_hdd_gnome failed hostname, but disappeared if manually retry (3 retries on 20200617) https://openqa.opensuse.org/tests/1306013#next_previous
- extra_tests_in_textmode, multiple failing modules because saved VM snapshot is reused multiple times. and test last more than 7hours when failing. https://openqa.opensuse.org/tests/1305506#next_previous
- gnome, failed at prepare_system_for_update_tests despite manual retries https://openqa.opensuse.org/tests/1306012#next_previous
Updated by michel_mno over 4 years ago
- Subject changed from test fails in hostname because "rcu_sched kthread starved" while VM snapshot to test fails because "rcu_sched kthread starved" while VM snapshot
- Priority changed from Normal to High
Updated by okurz over 4 years ago
- Subject changed from test fails because "rcu_sched kthread starved" while VM snapshot to [opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot
This might be related to the worker machine running the VMs being overloaded. I checked right now, there are 6 openQA worker instances configured with currently 5 running tests, one idle. RAM usage is 27G/128G and 15m load average is 1.98 so currently the machine seems to have enough of headroom. You could try to conduct a statistical investigation to find out the fail rate and maybe distinguish between the case when many tests are running in parallel or only a few. I can provide the help to e.g. temporarily disable some worker instances in case you want to run some longer tests of only a single parallel VM as reference.
Updated by ggardet_arm over 4 years ago
You may want to enable system_workarounds.pm
test, as done for aarch64.
Updated by SLindoMansilla over 4 years ago
- Blocks action #64114: [functional][u][ppc64le] test fails in updates_packagekit_gpk "client did not authorize action" added
Updated by michel_mno over 4 years ago
ggardet_arm wrote:
You may want to enable
system_workarounds.pm
test, as done for aarch64.
Thank you Guillaume for the pointer,
I am verifying your suggestion with pending pr#10591
https://openqa.opensuse.org/tests/overview?build=michelmno%2Fos-autoinst-distri-opensuse%2310591&version=Tumbleweed&distri=opensuse
Updated by michel_mno over 4 years ago
results with sysem_workaround.pm is not sufficient, would need something else
Updated by michel_mno over 4 years ago
- Status changed from New to Resolved
It seems the problem disappeared by setting back QEMUCPUS & QEMUTHREADS to values > 1 as per results
of three last TW snapshots where no reported rcu warnings:
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200701&groupid=4
(that one only with ppc64le DVD tests restarted with QEMU* changes)
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200702&groupid=4
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200704&groupid=4
Changes done in https://openqa.opensuse.org/admin/machines for ppc64* machine definitions
Updated by okurz about 1 year ago
- Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added