Project

General

Profile

Actions

action #68218

closed

[opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot

Added by michel_mno almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
2020-06-18
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

test fails in hostname because "rcu_sched kthread starved" while VM snapshot

Observation

openQA test in scenario opensuse-Tumbleweed-DVD-ppc64le-create_hdd_gnome@ppc64le fails in
hostname

as per two attached logs, the "rcu_sched kthread starved"
is reported two times in same time frame as VM snapshots.

hostname module failed because qemu seems to be overloaded
and some typed characters are lost
typed: "hostnamectl set-hostname susetest"
shell: "hostn"

=== extract hostname-journal.log: (timestamp here is 4 hours before UTC)
Jun 18 08:19:38.525735 localhost.localdomain dbus-daemon[989]: [system] Failed to activate service 'org.bluez': timed out (service_start_timeout=25000ms)
Jun 18 08:19:38.528317 localhost.localdomain pulseaudio[1687]: E: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Jun 18 08:19:38.531655 localhost.localdomain geoclue[1925]: Service not used for 60 seconds. Shutting down..
Jun 18 08:19:38.561998 localhost.localdomain systemd[1]: geoclue.service: Succeeded.
Jun 18 08:19:38.609279 localhost.localdomain kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun 18 08:19:38.609376 localhost.localdomain kernel: rcu:         0-...!: (1 ticks this GP) idle=fee/1/0x4000000000000002 softirq=39401/39401 fqs=0
Jun 18 08:19:38.609412 localhost.localdomain kernel:         (t=8411 jiffies g=14885 q=719)
Jun 18 08:19:38.609438 localhost.localdomain kernel: rcu: rcu_sched kthread starved for 8411 jiffies! g14885 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
...
Jun 18 08:19:38.610159 localhost.localdomain kernel: [c0000000c6ba3e20] [c000000000009a34] decrementer_common_virt+0x174/0x180
Jun 18 08:19:38.609047 localhost.localdomain rtkit-daemon[1717]: The canary thread is apparently starving. Taking action.
...
Jun 18 08:23:52.048693 localhost.localdomain login[2469]: ROOT LOGIN ON hvc1
Jun 18 08:25:44.544376 localhost.localdomain kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Jun 18 08:25:44.596688 localhost.localdomain kernel: rcu:         0-...!: (1 ticks this GP) idle=5ca/0/0x1 softirq=49769/49769 fqs=0
Jun 18 08:25:44.596817 localhost.localdomain kernel:         (t=10047 jiffies g=21097 q=978)
Jun 18 08:25:44.602524 localhost.localdomain kernel: rcu: rcu_sched kthread starved for 10047 jiffies! g21097 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
...
Jun 18 08:25:44.604829 localhost.localdomain kernel: [c000000001babf90] [c00000000000c49c] start_here_common+0x1c/0x400
Jun 18 08:25:44.543549 localhost.localdomain rtkit-daemon[1717]: The canary thread is apparently starving. Taking action.
===
=== extract autoinst-log.txt
[2020-06-18T12:18:14.456 UTC] [debug] Creating a VM snapshot lastgood
...
[2020-06-18T12:19:46.333 UTC] [debug] Migrating remaining bytes:        0
[2020-06-18T12:19:46.334 UTC] [debug] Snapshot complete
...
[2020-06-18T12:19:46.605 UTC] [debug] ||| starting system_prepare tests/console/system_prepare.pm
...
[2020-06-18T12:24:12.233 UTC] [debug] Creating a VM snapshot lastgood
...
[2020-06-18T12:25:49.587 UTC] [debug] Migrating remaining bytes:        0
[2020-06-18T12:25:49.588 UTC] [debug] Snapshot complete
...
[2020-06-18T12:25:49.643 UTC] [debug] ||| starting hostname tests/console/hostname.pm
[2020-06-18T12:25:49.644 UTC] [debug] tests/console/hostname.pm:32 called testapi::select_console
...
[2020-06-18T12:25:50.642 UTC] [debug] <<< testapi::type_string(string="hostnamectl set-hostname susetest", max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
...
===

Test suite description

image creation job used as parent for other jobs testing based on existing installation. GUI-Installation and selecting the gnome desktop during installation.

Reproducible

Fails since (at least) Build 20200617

Expected result

Last good: 20200616 (or more recent)

Further details

Always latest result in this scenario: latest


Files

hostname-journal.log (174 KB) hostname-journal.log michel_mno, 2020-06-18 14:55
autoinst-log.txt (643 KB) autoinst-log.txt michel_mno, 2020-06-18 14:55

Related issues 3 (0 open3 closed)

Related to openQA Tests - action #44156: [opensuse][functional][u][sporadic][ppc64le] test fails in yast2_lan because rcu message in user-consoleRejectedzluo2018-11-21

Actions
Related to openQA Infrastructure - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:MResolvedokurz2023-09-20

Actions
Blocks openQA Tests - action #64114: [functional][u][ppc64le] test fails in updates_packagekit_gpk "client did not authorize action"RejectedSLindoMansilla2020-03-03

Actions
Actions #1

Updated by michel_mno almost 4 years ago

  • Related to action #44156: [opensuse][functional][u][sporadic][ppc64le] test fails in yast2_lan because rcu message in user-console added
Actions #2

Updated by michel_mno almost 4 years ago

I added ref to issue#44156 for which we have a bypass with poo44156 needles.

BUT for this new issue#68218, there are rcu messages in root/user console, but also SUT overload because lost characters in typed commands of new module after VM restart.

This problem found on different tests:

Actions #3

Updated by michel_mno almost 4 years ago

  • Subject changed from test fails in hostname because "rcu_sched kthread starved" while VM snapshot to test fails because "rcu_sched kthread starved" while VM snapshot
  • Priority changed from Normal to High
Actions #4

Updated by okurz almost 4 years ago

  • Subject changed from test fails because "rcu_sched kthread starved" while VM snapshot to [opensuse][ppc64le] test fails because "rcu_sched kthread starved" while VM snapshot

This might be related to the worker machine running the VMs being overloaded. I checked right now, there are 6 openQA worker instances configured with currently 5 running tests, one idle. RAM usage is 27G/128G and 15m load average is 1.98 so currently the machine seems to have enough of headroom. You could try to conduct a statistical investigation to find out the fail rate and maybe distinguish between the case when many tests are running in parallel or only a few. I can provide the help to e.g. temporarily disable some worker instances in case you want to run some longer tests of only a single parallel VM as reference.

Actions #5

Updated by ggardet_arm almost 4 years ago

You may want to enable system_workarounds.pm test, as done for aarch64.

Actions #6

Updated by SLindoMansilla almost 4 years ago

  • Blocks action #64114: [functional][u][ppc64le] test fails in updates_packagekit_gpk "client did not authorize action" added
Actions #7

Updated by michel_mno almost 4 years ago

ggardet_arm wrote:

You may want to enable system_workarounds.pm test, as done for aarch64.

Thank you Guillaume for the pointer,
I am verifying your suggestion with pending pr#10591
https://openqa.opensuse.org/tests/overview?build=michelmno%2Fos-autoinst-distri-opensuse%2310591&version=Tumbleweed&distri=opensuse

Actions #8

Updated by michel_mno almost 4 years ago

results with sysem_workaround.pm is not sufficient, would need something else

Actions #9

Updated by michel_mno almost 4 years ago

  • Status changed from New to Resolved

It seems the problem disappeared by setting back QEMUCPUS & QEMUTHREADS to values > 1 as per results
of three last TW snapshots where no reported rcu warnings:
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200701&groupid=4
(that one only with ppc64le DVD tests restarted with QEMU* changes)
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200702&groupid=4
https://openqa.opensuse.org/tests/overview?distri=opensuse&version=Tumbleweed&build=20200704&groupid=4

Changes done in https://openqa.opensuse.org/admin/machines for ppc64* machine definitions

Actions #10

Updated by okurz 5 months ago

  • Related to action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Actions

Also available in: Atom PDF