action #41882

all arm worker die after some time

Added by szarate over 1 year ago. Updated 6 days ago.

Status:FeedbackStart date:02/10/2018
Priority:LowDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Current Sprint
Duration:

Description

Machine dies after ~1 day. Seems like the kernel is stuck - no network available and SOL is just stuck (no agetty promt).
Ticket for investigation from infra can be found here: LTBD

Screenshot_20200108_111458.png (74.6 KB) okurz, 08/01/2020 10:35 am

9209

Related issues

Related to openQA Tests - action #50123: [functional][u][aarch64] Can't open 'os-autoinst.pid' for... Rejected 08/04/2019
Related to openQA Tests - action #50129: [functional][u] cannot remove directory for testresults: ... Rejected 08/04/2019
Related to openQA Project - action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd... Resolved 31/07/2019
Related to openQA Tests - action #25864: [tools][functional][u] stall detected in openqaworker-arm... Blocked 09/10/2017

History

#1 Updated by szarate over 1 year ago

  • Private changed from Yes to No

#2 Updated by coolo over 1 year ago

  • Priority changed from Normal to Urgent
  • Target version changed from Ready to Current Sprint

This host needs to be removed from the cluster (salt+dashboard) or being brought back

#3 Updated by coolo over 1 year ago

  • Project changed from openQA Project to openQA Infrastructure
  • Category deleted (168)

#4 Updated by nicksinger over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger

#5 Updated by nicksinger over 1 year ago

First issue why we can't see anything after grub: console needs to be changed to ttyAMA0

#6 Updated by nicksinger over 1 year ago

  • Status changed from In Progress to Resolved

Hm… don't really know what should be wrong with the machine. After changing the console to ttyAMA0 I was able to completely boot the machine and finish the "saltification".
It shows up in openQA as well as grafana again. Setting to feedback until the first jobs rolled by.

#7 Updated by nicksinger over 1 year ago

  • Subject changed from openqaworker-arm-3 seems dead to openqaworker-arm-3 dies after some time.
  • Status changed from Resolved to Workable

Jup indeed, after one day of working properly It just self-destroyed again. eth0 not ping-able, sol activate stuck. BMC seems still to work fully.

#8 Updated by nicksinger over 1 year ago

  • Description updated (diff)
  • Status changed from Workable to Blocked

#10 Updated by mkittler 12 months ago

  • Subject changed from openqaworker-arm-3 dies after some time. to openqaworker-arm-3 and openqaworker-arm-1 die after some time

openqaworker-arm-1 is down again. I guess we need to extend this ticket to openqaworker-arm-1.

#11 Updated by SLindoMansilla 11 months ago

  • Related to action #50123: [functional][u][aarch64] Can't open 'os-autoinst.pid' for writing: 'Structure needs cleaning' at /usr/bin/isotovideo line 300 added

#12 Updated by SLindoMansilla 11 months ago

  • Related to action #50129: [functional][u] cannot remove directory for testresults: Directory not empty at /usr/lib/os-autoinst/bmwqemu.pm line 126 added

#13 Updated by nicksinger 10 months ago

  • Subject changed from openqaworker-arm-3 and openqaworker-arm-1 die after some time to all arm worker die after some time

It affects all of our arm workers in the meantime.
I've enabled monitoring for them to instantly realize when they are down and will try to collect any logging at all.
Until now I failed with the approaches:

  • Connect a monitor when the machine is "unresponsive" -> only prompt, no kernel log traces seen
  • Connect a long running SOL session to catch kernel messages -> "ERROR" when looking for the machines once they crashed

My last idea is to enable the kernels netconsole to gather some logs at least.

Until now nobody was able to help us without any logs. We still didn't find a reliable reproducer besides "letting the machines run long enough with openQA workload" :(

#14 Updated by okurz 7 months ago

any update in the Engineering Infrastructure ticket?

#15 Updated by okurz 6 months ago

  • Assignee changed from nicksinger to okurz

Actually I can access https://infra.nue.suse.com/SelfService/Display.html?id=123196 for some reason. It was resolved with "ask the ARM guys" so no need to wait further for the ticket. The symptoms look similar to what I am tracking already in #53471 so I guess you don't mind if I pick up this ticket as well and block by #53471

#16 Updated by okurz 6 months ago

  • Related to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added

#17 Updated by okurz 5 months ago

  • Status changed from Blocked to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/172 as workaround, https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/173 for debugging. nsinger is also working on an alternative to https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/172 to detect the workers dying from grafana and triggering a reboot from there.

EDIT: The automatic detection of workers being down is done as in https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?tab=alert&orgId=1&fullscreen&edit&panelId=3 and the power cycling is triggered in https://gitlab.suse.de/nicksinger/grafana-webhook-actions/-/jobs . As this works ok-ish over the past days nsinger planned to move to "openqa/…" project scope. Suggestion for improvement are in https://gitlab.suse.de/nicksinger/grafana-webhook-actions/merge_requests/1

Today in the morning of 2019-09-27 arm3 is still inoperable from yesterday as services failed to start. For this I plan as workaround to apply proper systemd based "restart" policies. Then arm2 went down and was not restarted, I triggered the power cycle manually. Then arm1 went down. I wonder if too high load causes this. Because the other two machines arm2 and arm3 were basically unusable, arm1 might have been hit by higher workload. https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashoard-openqaworker-arm-1?orgId=1&from=1569565893750&to=1569566741861&fullscreen&panelId=54694 shows that the load increased before the host went down. https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-2/worker-dashoard-openqaworker-arm-2?orgId=1&from=1569561977310&to=1569565306365&fullscreen&panelId=54694 shows the same for arm2, but it does not seem to be "I/O wait" which would point to problems with the storage.

EDIT: 2019-10-01: nsinger moved gitlab repo to https://gitlab.suse.de/openqa/grafana-webhook-actions now
EDIT: 2019-10-02: Found on openqaworker-arm-1 during bootup:

[ SKIP ] Ordering cycle found, skipping Update is Completed
[ SKIP ] Ordering cycle found, skipping Rebuild Journal Catalog
[ SKIP ] Ordering cycle found, skipping Watch for changes in CA certificates
[ SKIP ] Ordering cycle found, skipping Tell Plymouth To Write Out Runtime Data
[ SKIP ] Ordering cycle found, skipping Update UTMP about System Boot/Shutdown
[ SKIP ] Ordering cycle found, skipping Security Auditing Service
[ SKIP ] Ordering cycle found, skipping Local File Systems

I suspect the side-/upgrade to have caused this. Probably I can cleanup some unused packages and services first before trying to find cycles, e.g. salt '*arm*' cmd.run 'zypper --no-refresh -n rm -u cups cups-client cups-config cups-filters libcups2 libcupscgi1 libcupsimage2 libcupsmime1 libcupsppdc1'

Rebooted arm1 and:

[ SKIP ] Ordering cycle found, skipping D-Bus System Message Bus Socket
[ SKIP ] Ordering cycle found, skipping Security Auditing Service
[ SKIP ] Ordering cycle found, skipping Tell Plymouth To Write Out Runtime Data
[ SKIP ] Ordering cycle found, skipping Local File Systems

oh, and [FAILED] Failed to start wicked managed network interfaces. so network busted ;D

another reboot and the only problem I see is

[ SKIP ] Ordering cycle found, skipping /var/lib/openqa/nvme/pool

and network is fine. The above is troubling though. I checked, the directory is actually not mounted. Fixed temporarily with mount -a.

journalctl -b shows:

Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found ordering cycle on local-fs.target/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on var-lib-openqa-nvme-pool.mount/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on var-lib-openqa-nvme.mount/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on openqa_nvme_format.service/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on basic.target/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on paths.target/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on btrfsmaintenance-refresh.path/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on sysinit.target/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Found dependency on local-fs.target/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: local-fs.target: Breaking ordering cycle by deleting job var-lib-openqa-nvme-pool.mount/start
Oct 04 18:28:19 openqaworker-arm-1 systemd[1]: var-lib-openqa-nvme-pool.mount: Job var-lib-openqa-nvme-pool.mount/start deleted to break ordering cycle starting with local-fs.target/start

maybe https://unix.stackexchange.com/questions/193714/generic-methodology-to-debug-ordering-cycles-in-systemd can help.

I checked on which machines we have cycles with

salt '*' cmd.run 'journalctl | grep "ordering cycle"'

and I found them on all arm1…3 but only there so I rather suspect the nvme prepare service rather than the SLE-to-Leap sidegrade being the culprit.

On arm3:

sudo systemctl mask --now salt-minion openqa-worker.target
sudo vim /etc/systemd/system/openqa_nvme_format.service

added DefaultDependencies=no as per https://bugzilla.suse.com/show_bug.cgi?id=1149603#c10 and sudo sync && sudo systemctl kexec. Still a cycle, added DefaultDependencies=no to /etc/systemd/system/var-lib-openqa-cache.mount.d/override.conf and trying again. Also needed to add the line to /etc/systemd/system/openqa_nvme_prepare.service and /etc/systemd/system/var-lib-openqa-pool.mount.d/override.conf , rebooted and sudo journalctl -b | grep -i '\(ordering cycle\|Found dependency\)' does not report any new.

Fixed in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192

#18 Updated by okurz 5 months ago

  • Priority changed from Urgent to High

as we have the automatic restart actions in https://gitlab.suse.de/openqa/grafana-webhook-actions now the prio is not "Urgent" anymore but still "High" as even on automatic restarts services within the machines can fail to start up.

#19 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

with the ordering cycles fixed, all three arm workers up and automatically rebooted based on grafana alerts and gitlab CI actions I consider this done.

#20 Updated by okurz 3 months ago

  • Status changed from Resolved to Feedback

#21 Updated by okurz 3 months ago

according to slindomansilla the reduce of workers is not applied automatically, he did that manually after salt was applied. I thought we had that handled automatically. https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=now-30d&to=now shows that arm-2 is still very unstable but it could be that arm-1 improved, too early to say. I tried out "sensors" on arm-2 but the machine does not provide any sensors that are readable by the OS. I am now running while sleep 30; do (date -Is; ipmitool -I lanplus -H openqaworker-arm-2-ipmi.suse.de -U
ADMIN -P ADMIN sensor) | tee -a arm-2_sensors-$(date -I).log; done
on lord.arch saving to /local/. The next time the machine dies we can compare values and see if there are e.g. voltage drops or temperature problems.

EDIT: 2019-12-11: Again the machine died as visible in https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=1576072244796&to=1576081908796&fullscreen&panelId=3 . What I can see in the sensor output recorded as described above:

idle time in the morning:

2019-12-11T06:43:25+01:00
CPU0_TEMP        | 38.000     | degrees C  | ok    | na        | 0.000     | 5.000     | 85.000    | 88.000    | na        
CPU1_TEMP        | 42.000     | degrees C  | ok    | na        | 0.000     | 5.000     | 85.000    | 88.000    | na        
VR_P0_IOUT       | 54.000     | Amps       | ok    | na        | na        | na        | 228.000   | 248.000   | na        
VR_P1_IOUT       | 58.000     | Amps       | ok    | na        | na        | na        | 228.000   | 248.000   | na        
BPB_FAN_1A       | 7400.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
BPB_FAN_2A       | 7300.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
BPB_FAN_3A       | 7400.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
BPB_FAN_4A       | 7500.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
MB_TEMP1         | 31.000     | degrees C  | ok    | na        | 0.000     | 5.000     | 75.000    | 80.000    | na        
2019-12-11T15:12:13+01:00
CPU0_TEMP        | 41.000     | degrees C  | ok    | na        | 0.000     | 5.000     | 85.000    | 88.000    | na        
CPU1_TEMP        | 45.000     | degrees C  | ok    | na        | 0.000     | 5.000     | 85.000    | 88.000    | na        
VR_P0_IOUT       | 67.000     | Amps       | ok    | na        | na        | na        | 228.000   | 248.000   | na        
VR_P1_IOUT       | 71.000     | Amps       | ok    | na        | na        | na        | 228.000   | 248.000   | na        
BPB_FAN_1A       | 7400.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
BPB_FAN_2A       | 7300.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
BPB_FAN_3A       | 7400.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
BPB_FAN_4A       | 7500.000   | RPM        | ok    | na        | 800.000   | 1000.000  | na        | na        | na        
MB_TEMP1         | 32.000     | degrees C  | ok    | na        | 0.000     | 5.000     | 75.000    | 80.000    | na        

so clearly CPU temp and drawn electrical power increased at a time when the worker was receiving higher load, doesn't mean that the machine can not handle this though. However I will try to reduce the workers further. Did sudo systemctl mask --now openqa-worker@{5..20} on arm-2. Let's see if we crash again.

EDIT: 2019-12-11: the management interface of openqaworker-arm-3 appeared again. I powered the machine off, waited, powered on, waited 10m and I could ssh. Also the SOL showed a login prompt after disconnecting/connecting again. although no information during bootup. I updated the machine, accepted the salt key on osd and applied high state. Should be good to go again. Actually I think besides the major one-month outage the machine was ok and not that problem as arm-1 and arm-2, see https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?from=now-90d&to=now . Updated pillar data again to use the full numbers of worker on arm-3. So currently we have 20 on arm-1, 4 on arm-2, 30 on arm-3.

#22 Updated by okurz about 1 month ago

So currently realized that arm-1 fails to respond to a salt call in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/156074 . I could login to the machine though and see the high load:

# cat /proc/loadavg 
302.47 302.64 302.70 6/1675 40425

dmesg output:

Jan 08 06:18:02 openqaworker-arm-1 kernel: BUG: workqueue lockup - pool cpus=26 node=0 flags=0x0 nice=0 stuck for 11857s!
Jan 08 06:18:02 openqaworker-arm-1 kernel: Showing busy workqueues and worker pools:
Jan 08 06:18:02 openqaworker-arm-1 kernel: workqueue mm_percpu_wq: flags=0x8
Jan 08 06:18:02 openqaworker-arm-1 kernel:   pwq 52: cpus=26 node=0 flags=0x0 nice=0 active=1/256
Jan 08 06:18:02 openqaworker-arm-1 kernel:     pending: lru_add_drain_per_cpu BAR(39133)
Jan 08 06:18:02 openqaworker-arm-1 kernel: workqueue writeback: flags=0x4e
Jan 08 06:18:02 openqaworker-arm-1 kernel:   pwq 96: cpus=0-47 flags=0x4 nice=0 active=1/256
Jan 08 06:18:02 openqaworker-arm-1 kernel:     in-flight: 5593:wb_workfn
Jan 08 06:18:02 openqaworker-arm-1 kernel: workqueue kblockd: flags=0x18
Jan 08 06:18:02 openqaworker-arm-1 kernel:   pwq 17: cpus=8 node=0 flags=0x0 nice=-20 active=1/256
Jan 08 06:18:02 openqaworker-arm-1 kernel:     in-flight: 0:blk_mq_timeout_work
Jan 08 06:18:02 openqaworker-arm-1 kernel:   pwq 7: cpus=3 node=0 flags=0x0 nice=-20 active=2/256
Jan 08 06:18:02 openqaworker-arm-1 kernel:     in-flight: 20871:blk_mq_timeout_work blk_mq_timeout_work
Jan 08 06:18:02 openqaworker-arm-1 kernel: pool 7: cpus=3 node=0 flags=0x0 nice=-20 hung=11816s workers=3 idle: 20907 30
Jan 08 06:18:02 openqaworker-arm-1 kernel: pool 17: cpus=8 node=0 flags=0x0 nice=-20 hung=12965s workers=3 idle: 29568 60
Jan 08 06:18:02 openqaworker-arm-1 kernel: pool 96: cpus=0-47 flags=0x4 nice=0 hung=0s workers=6 idle: 40179 40011 40204 40261
Jan 08 06:18:06 openqaworker-arm-1 kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jan 08 06:18:06 openqaworker-arm-1 kernel:         26-...: (1 GPs behind) idle=23e/140000000000000/0 softirq=443547/443548 fqs=1125189 
Jan 08 06:18:06 openqaworker-arm-1 kernel:         (detected by 25, t=2670743 jiffies, g=270909, c=270908, q=14756827)
Jan 08 06:18:06 openqaworker-arm-1 kernel: Task dump for CPU 26:
Jan 08 06:18:06 openqaworker-arm-1 kernel: lspci           R  running task        0 30539   7214 0x00000002
Jan 08 06:18:06 openqaworker-arm-1 kernel: Call trace:
Jan 08 06:18:06 openqaworker-arm-1 kernel:  ret_from_fork+0x0/0x20
lines 944-1001/1001 (END)
Message from syslogd@openqaworker-arm-1 at Jan  8 06:18:33 ...
 kernel:[44878.886279] BUG: workqueue lockup - pool cpus=26 node=0 flags=0x0 nice=0 stuck for 11888s!

on arm-1 .

There seem to be many processes stuck on I/O and many zombies:

$ tail -n 40 ps_auxf_arm_1-2020-01-08 
_openqa+ 48094  100  0.2 2035544 617788 ?      Sl   Jan07 588:50          \_ /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -m 1024 -machine virt,usb=off,gic-version=3,its=off -cpu host -netdev tap,id=qanet0,ifname=tap13,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:12:01:e7 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :104,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/14/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/14/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/14/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/14/raid/pflash-vars-overlay0,unit=1
_openqa+ 15572  0.0  0.0  77520 71360 ?        Ds   Jan07   0:23 /usr/bin/perl /usr/share/openqa/script/worker --instance 15
_openqa+ 41949  0.1  0.0 231784 222240 ?       S    Jan07   0:59  \_ perl /usr/bin/isotovideo -d
_openqa+ 41954  0.0  0.0  70744 53040 ?        S    Jan07   0:04      \_ /usr/bin/isotovideo: comman
_openqa+ 41976  0.1  0.0 1019964 183980 ?      D    Jan07   0:47      \_ /usr/bin/isotovideo: autote
_openqa+ 41977  5.9  0.1 4562084 457236 ?      Dl   Jan07  36:13      \_ /usr/bin/isotovideo: backen
_openqa+ 41978  1.2  0.0 882628 56736 ?        SN   Jan07   7:19          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/nvme/pool/15/video.ogv
_openqa+ 42013 15.5  0.5 6437164 1516604 ?     Sl   Jan07  94:16          \_ /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -m 1024 -machine virt,usb=off,gic-version=3,its=off -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :105,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/15/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/15/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,bootindex=0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/15/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/15/raid/pflash-vars-overlay0,unit=1
_openqa+ 15583  0.0  0.0  96684 90544 ?        Ds   Jan07   0:28 /usr/bin/perl /usr/share/openqa/script/worker --instance 16
_openqa+ 13518  0.0  0.0 155152 141984 ?       S    Jan07   0:14  \_ perl /usr/bin/isotovideo -d
_openqa+ 13526  0.0  0.0  70496 52724 ?        S    Jan07   0:03      \_ /usr/bin/isotovideo: comman
_openqa+ 13594  0.0  0.0 960300 135768 ?       S    Jan07   0:05      \_ /usr/bin/isotovideo: autote
_openqa+ 13597  0.5  0.2 4870016 765108 ?      Dl   Jan07   3:16      \_ /usr/bin/isotovideo: backen
_openqa+ 13598  0.8  0.0 882776 56192 ?        DN   Jan07   4:31          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/nvme/pool/16/video.ogv
_openqa+ 13632  3.1  0.5 6236572 1397708 ?     Sl   Jan07  17:09          \_ /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -m 1024 -machine virt,usb=off,gic-version=3,its=off -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :106,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/16/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/16/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/16/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/16/raid/pflash-vars-overlay0,unit=1
_openqa+ 15596  0.0  0.0  71604 65808 ?        Ss   Jan07   0:24 /usr/bin/perl /usr/share/openqa/script/worker --instance 17
_openqa+ 29067  0.0  0.0 182208 170704 ?       D    Jan07   0:16  \_ perl /usr/bin/isotovideo -d
_openqa+ 29078  0.0  0.0      0     0 ?        Z    Jan07   0:01      \_ [/usr/bin/isotov] <defunct>
_openqa+ 29103  0.0  0.0 976780 150636 ?       D    Jan07   0:15      \_ /usr/bin/isotovideo: autote
_openqa+ 29105  1.2  0.1 4467172 370808 ?      Dl   Jan07   7:57      \_ /usr/bin/isotovideo: backen
_openqa+ 29106  1.7  0.0      0     0 ?        ZN   Jan07  11:18          \_ [videoencoder] <defunct>
_openqa+ 29165  0.5  0.0      0     0 ?        Zl   Jan07   3:37          \_ [qemu-system-aar] <defunct>
_openqa+ 15607  0.0  0.0 116076 110412 ?       Ds   Jan07   0:34 /usr/bin/perl /usr/share/openqa/script/worker --instance 18
_openqa+ 19435  0.0  0.0 155976 142204 ?       S    Jan07   0:09  \_ perl /usr/bin/isotovideo -d
_openqa+ 19441  0.0  0.0  70096 52344 ?        S    Jan07   0:03      \_ /usr/bin/isotovideo: comman
_openqa+ 19473  0.0  0.0 964156 136768 ?       S    Jan07   0:01      \_ /usr/bin/isotovideo: autote
_openqa+ 19475  0.3  0.3 4902720 800988 ?      Dl   Jan07   2:04      \_ /usr/bin/isotovideo: backen
_openqa+ 19476  0.0  0.0 882452 57476 ?        DN   Jan07   0:29          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/nvme/pool/18/video.ogv
_openqa+ 19513  1.5  0.4 2260116 1072312 ?     Sl   Jan07   8:17          \_ /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -soundhw hda -m 1024 -machine virt,usb=off,gic-version=3,its=off -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 2 -enable-kvm -no-shutdown -vnc :108,share=force-shared -chardev socket,path=qmp_socket,server,nowait,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/18/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/nvme/pool/18/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/18/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/nvme/pool/18/raid/pflash-vars-overlay0,unit=1
_openqa+ 15620  0.0  0.0  72524 66696 ?        Ss   Jan07   0:19 /usr/bin/perl /usr/share/openqa/script/worker --instance 19
_openqa+ 29657  0.5  0.0      0     0 ?        Zl   Jan07   3:29  \_ [qemu-system-aar] <defunct>
root     15683  0.9  0.0 6326216 139284 ?      Ssl  Jan07   6:15 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
root     21512  0.0  0.0 163252  5272 ?        DNs  Jan07   0:00 (coredump)
root     23247  0.0  0.0 163252  5272 ?        DNs  Jan07   0:00 (coredump)
root     23249  0.0  0.0 163252  5272 ?        DNs  Jan07   0:00 (coredump)
root     30466  0.0  0.0 1006652 62748 ?       Dl   Jan07   0:04 /usr/bin/python3 /usr/bin/salt-minion
root     39124  0.0  0.0  36416 20016 ?        Ss   03:00   0:00 /usr/bin/zypper -n patch --replacefiles --auto-agree-with-licenses --force-resolution --download-in-advance
root     39133  0.0  0.0   8424  1380 ?        D    03:00   0:00 /usr/bin/gpg2 --version
okurz    40270  0.2  0.0  14388  7148 ?        Ss   06:16   0:00 /usr/lib/systemd/systemd --user
okurz    40271  0.0  0.0 168576  3088 ?        S    06:16   0:00  \_ (sd-pam)

after three more hours the system does not react on ping anymore, ssh does not work but SOL is still active albeit slow to login. Seems many processes are stuck in IO wait, https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashoard-openqaworker-arm-1?orgId=1&refresh=1m&from=now-24h&to=now reflects this as well. https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashoard-openqaworker-arm-1?orgId=1&from=1578402000000&to=1578474000000

dd bs=1M if=/dev/urandom of=test.img count=10240

gets blocked near immediate.

root     41857  0.0  0.0   3452  2320 pts/2    D+   09:41   0:00              |   \_ dd bs=1M if=/dev/urandom of=test.img count=10240
root     41884  0.0  0.0   3096   488 pts/3    S+   09:41   0:00                  \_ grep --color=auto dd

and

openqaworker-arm-1:~ # cat /proc/41857/stack
[<ffff000008087f14>] __switch_to+0xe4/0x150
[<ffff00000810ff08>] io_schedule+0x20/0x40
[<ffff00000823b0f0>] __lock_page+0xf0/0x120
[<ffff00000823c610>] pagecache_get_page+0x1d0/0x270
[<ffff0000083a66c0>] ext4_mb_init_group+0x90/0x278
[<ffff0000083a6a60>] ext4_mb_good_group+0x1b8/0x1c8
[<ffff0000083aa1d4>] ext4_mb_regular_allocator+0x18c/0x4c0
[<ffff0000083ac528>] ext4_mb_ne/0x50
[<ffff000008395e98>] ext4_block_write_begin+0x138/0x490
[<ffff00000839e430>] ext4_write_begin+0x160/0x550
[<ffff00000823bf08>] generic_perform_write+0x98/0x188
[<ffff00000823d9d0>] __generic_file_write_iter+0x158/0x1c8
[<ffff00000838a41c>] ext4_file_write_iter+0xa4/0x388
[<ffff0000082d7ff8>] __vfs_write+0xd0/0x148
[<ffff0000082d9774>] vfs_write+0xac/0x1b8
[<ffff0000082dae3c>] SyS_write+0x54/0xb0
[<ffff000008083d30>] el0_svc_naked+0x44/0x48
[<ffffffffffffffff>] 0xffffffffffffffff

checking the NVME state turned out "live", e.g. with cat /sys/class/nvme/nvme0/state. We tried to reset it causing:

cat /sys/class/nvme/nvme0/state 
resetting

but not completing.

#23 Updated by okurz about 1 month ago

screenshot of grafana dashboard showing the maxed load and IO wait:

Screenshot_20200108_111458.png

I am not yet sure if this is a reproducible problem or the real issue we are looking for.

#24 Updated by okurz about 1 month ago

okurz wrote:

Updated pillar data again to use the full numbers of worker on arm-3. So currently we have 20 on arm-1, 4 on arm-2, 30 on arm-3.

https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?from=1577833200000&to=1578610799000 shows 3 reboots for arm-1 and arm-3 but only 1 for arm-2 with the reduced set of worker instances. Could be that reduced number of worker instances help. But salt state apply fails as in https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/156742 because of the masked workers. We should use salt pillar data instead of masked workers. So unmasked and enabled openQA worker instance {5..30} on arm-2 again.

https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/206 created and merged to bump arm-2 again to 30 but arm-3 down to 4 to crosscheck.

#25 Updated by okurz about 1 month ago

added a comment on https://trello.com/c/JQtnALhz/6-openqa-hw-budget-planning#comment-5e185a3e9a5c3786c32fd089 for a kind request for budget for new hardware.

#26 Updated by okurz about 1 month ago

  • Related to action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues" added

#27 Updated by okurz about 1 month ago

okurz wrote:

[…]
https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/206 created and merged to bump arm-2 again to 30 but arm-3 down to 4 to crosscheck.

I forgot to manually disable the services when reducing the numbers of openQA worker instances. https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1&from=1579548585485&to=1579603178052 shows that arm3 failed again but there are still 30 worker instances:

sudo systemctl mask --now openqa-worker@{5..30}

#28 Updated by okurz about 1 month ago

Today openqaworker-arm-1 acted up. So osd-deployment in https://gitlab.suse.de/openqa/osd-deployment/-/jobs/161879 didn't finish because arm1 isn't/wasn't responsive. So I added retry to the gitlab CI pipeline with https://gitlab.suse.de/openqa/osd-deployment/commit/3af7d93b3ddfd61d4965dae426d3b4d06f1212e3 but still to no avail so far.

The machine is pingable and one can login over ssh but it does not react in time to salt. Being logged in one is greeted with
"BUG: workqueue lockup - pool cpus=15 node=0 flags=0x0 nice=0 stuck for …"

Given the amount of bugs I found with http://fastzilla.suse.de/?q=workqueue+lockup I wonder if I should open another bug, probably not. Looked into this with Santi who was in the room and he suggested me to do https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/7771/files#diff-0c79f77e8ce8ef2a80d9311b5f67041dR27-R28 so we have stack traces on the console, yay. That is

sysctl -w kernel.softlockup_panic=1
sysctl -p

With https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/217 I propose to reduce the amount of worker instances on arm-1 for the time being as well.

example message with stack trace from journalctl -kf:

Jan 22 10:08:13 openqaworker-arm-1 kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jan 22 10:08:13 openqaworker-arm-1 kernel:         23-...: (1 GPs behind) idle=9ca/140000000000000/0 softirq=420584/420588 fqs=724371 
Jan 22 10:08:13 openqaworker-arm-1 kernel:         (detected by 8, t=1896528 jiffies, g=400737, c=400736, q=12082145)
Jan 22 10:08:13 openqaworker-arm-1 kernel: Task dump for CPU 23:
Jan 22 10:08:13 openqaworker-arm-1 kernel: lspci           R  running task        0  3725   8151 0x00000002
Jan 22 10:08:13 openqaworker-arm-1 kernel: Call trace:
Jan 22 10:08:13 openqaworker-arm-1 kernel:  ret_from_fork+0x0/0x20
Jan 22 10:08:27 openqaworker-arm-1 kernel: BUG: workqueue lockup - pool cpus=15 node=0 flags=0x0 nice=0 stuck for 7575s!
Jan 22 10:08:27 openqaworker-arm-1 kernel: Showing busy workqueues and worker pools:
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue events: flags=0x0
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 30: cpus=15 node=0 flags=0x0 nice=0 active=2/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 322:wait_rcu_exp_gp
Jan 22 10:08:27 openqaworker-arm-1 kernel:     pending: cache_reap
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue mm_percpu_wq: flags=0x8
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 30: cpus=15 node=0 flags=0x0 nice=0 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     pending: vmstat_update
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue writeback: flags=0x4e
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 96: cpus=0-47 flags=0x4 nice=0 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 43459:wb_workfn
Jan 22 10:08:27 openqaworker-arm-1 kernel: workqueue kblockd: flags=0x18
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 65: cpus=32 node=0 flags=0x0 nice=-20 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 0:blk_mq_timeout_work
Jan 22 10:08:27 openqaworker-arm-1 kernel:   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
Jan 22 10:08:27 openqaworker-arm-1 kernel:     in-flight: 0:blk_mq_timeout_work
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 3: cpus=1 node=0 flags=0x0 nice=-20 hung=331s workers=3 idle: 46484 18
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 30: cpus=15 node=0 flags=0x0 nice=0 hung=7575s workers=2 idle: 101
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 65: cpus=32 node=0 flags=0x0 nice=-20 hung=3138s workers=3 idle: 1204 204
Jan 22 10:08:27 openqaworker-arm-1 kernel: pool 96: cpus=0-47 flags=0x4 nice=0 hung=0s workers=7 idle: 13255 12971 9864 14174 13163 14060

sometimes there are processes mentioned, as in the above example of lspci:

lspci           R  running task        0  3725   8151 0x00000002

that's lspci with PID 3725 which is triggered by salt-minion:

root      3150  0.0  0.0  39532 28028 ?        Ss   Jan21   0:01 /usr/bin/python3 /usr/bin/salt-minion
root      8151  0.1  0.0 998512 71104 ?        Sl   Jan21   0:58  \_ /usr/bin/python3 /usr/bin/salt-minion
root      8814  0.0  0.0 123024 28572 ?        S    Jan21   0:00      \_ /usr/bin/python3 /usr/bin/salt-minion
root      3725  0.0  0.0   3476   900 ?        R    04:52   0:00      \_ /sbin/lspci -vmm

I tried to kill lspci but that did not seem to help. Now finally the machine seems to be dead. I power cycled the machine as it does not react to magic sysrq over SoL anymore even. Next time we should follow https://bugzilla.suse.com/show_bug.cgi?id=1106840#c2 , i.e. magic-sysrq-t and record the complete log for a bug report.

EDIT: 2020-02-04: Found again openqaworker-arm-1 with a similar problem, again lspci:

[Tue Feb  4 10:18:34 2020] SFW2-INext-ACC-TCP IN=eth0 OUT= MAC=1c:1b:0d:68:7e:c7:00:de:fb:e3:d7:7c:08:00 SRC=10.163.1.98 DST=10.160.0.245 LEN=60 TOS=0x00 PREC=0x00 TTL=62 ID=62479 DF PROTO=TCP SPT=32952 DPT=22 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (0204051C0402080A1D51F8FA0000000001030307) 
[Tue Feb  4 10:19:46 2020] INFO: rcu_sched detected stalls on CPUs/tasks:
[Tue Feb  4 10:19:46 2020]      2-...: (1 GPs behind) idle=fbe/140000000000000/0 softirq=6509573/6509574 fqs=14344932 
[Tue Feb  4 10:19:46 2020]      (detected by 43, t=34503583 jiffies, g=6715213, c=6715212, q=231673150)
[Tue Feb  4 10:19:46 2020] Task dump for CPU 2:
[Tue Feb  4 10:19:46 2020] lspci           R  running task        0 26073   7344 0x00000002
[Tue Feb  4 10:19:46 2020] Call trace:
[Tue Feb  4 10:19:46 2020]  __switch_to+0xe4/0x150
[Tue Feb  4 10:19:46 2020]  0xffffffffffffffff

I triggered a complete stack trace of all processes with echo t > /proc/sysrq-trigger and saved the output with dmesg -T > /tmp/dmesg_$(date +%F).

The dmesg log contains complete dmesg output from the current boot, the first suspicious entry:

[584684.843935] nvme nvme0: I/O 328 QID 26 timeout, completion polled
[585160.113773] nvme nvme0: I/O 985 QID 4 timeout, completion polled
[585160.113804] nvme nvme0: I/O 9 QID 23 timeout, completion polled
[591194.945207] nvme nvme0: I/O 209 QID 25 timeout, completion polled
[597368.812505] Synchronous External Abort: synchronous external abort (0x96000210) at 0xffff000018be201c
[597368.812513] Internal error: : 96000210 [#1] SMP
[597368.817161] Modules linked in: nf_log_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_TCPMSS nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_nat nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfnetlink_cthelper nfnetlink nfs lockd grace sunrpc fscache scsi_transport_iscsi af_packet tun ip_gre gre ip_tunnel openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_physdev br_netfilter bridge stp llc xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack libcrc32c ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 vfat fat nicvf cavium_ptp nicpf joydev thunder_bgx mdio_thunder mdio_cavium thunderx_edac thunder_xcv
[597368.888273]  cavium_rng_vf cavium_rng uio_pdrv_genirq aes_ce_blk crypto_simd uio cryptd aes_ce_cipher crc32_ce crct10dif_ce ghash_ce aes_arm64 ipmi_ssif ipmi_devintf ipmi_msghandler sha2_ce sha256_arm64 sha1_ce btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate raid6_pq hid_generic usbhid ast xhci_pci i2c_algo_bit xhci_hcd drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm nvme nvme_core gpio_keys drm_panel_orientation_quirks usbcore thunderx_mmc i2c_thunderx i2c_smbus mmc_core dm_mirror dm_region_hash dm_log sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[597368.942048] CPU: 3 PID: 480 Comm: kworker/3:1H Not tainted 4.12.14-lp151.28.36-default #1 openSUSE Leap 15.1
[597368.951977] Hardware name: GIGABYTE R120-T32/MT30-GS1, BIOS T19 09/29/2016
[597368.958963] Workqueue: kblockd blk_mq_timeout_work
[597368.963850] task: ffff803ee7152000 task.stack: ffff803e5fae8000
[597368.969863] pstate: 60000005 (nZCv daif -PAN -UAO)
[597368.974756] pc : nvme_timeout+0x48/0x340 [nvme]
[597368.979382] lr : blk_mq_check_expired+0x140/0x178
[597368.984180] sp : ffff803e5faebc20
[597368.987587] x29: ffff803e5faebc20 x28: ffff803e2c05b800 
[597368.992995] x27: 000000000000000e x26: 0000000000000000 
[597368.998405] x25: ffff803eedb73810 x24: ffff803e22950000 
[597369.003813] x23: ffff803e22950138 x22: ffff803e25ab1380 
[597369.009223] x21: ffff803ee735c000 x20: ffff000009339000 
[597369.014634] x19: ffff803e22950000 x18: 0000ffff7c2ee000 
[597369.020042] x17: 0000ffff7bf687a0 x16: ffff0000082f46d0 
[597369.025451] x15: 0000d0dd25e9e532 x14: 0029801fcab7a304 
[597369.030870] x13: 000000005e33ec9f x12: 0000000000000018 
[597369.036286] x11: 0000000005ea2fbc x10: 0000000000001950 
[597369.041695] x9 : ffff803e5faebd80 x8 : ffff803ee71539b0 
[597369.047102] x7 : 0000000000000002 x6 : ffff803e22ae94d0 
[597369.052513] x5 : ffff803c40d30258 x4 : ffff000009337000 
[597369.057922] x3 : 0000000000000001 x2 : ffff000018be201c 
[597369.063329] x1 : 0000000000000000 x0 : ffff000009339710 
[597369.068739] Process kworker/3:1H (pid: 480, stack limit = 0xffff803e5fae8000)
[597369.075968] Call trace:
[597369.078518]  nvme_timeout+0x48/0x340 [nvme]
[597369.082799]  blk_mq_check_expired+0x140/0x178
[597369.087252]  bt_for_each+0x118/0x140
[597369.090923]  blk_mq_queue_tag_busy_iter+0xa8/0x140
[597369.095808]  blk_mq_timeout_work+0x58/0x118
[597369.100087]  process_one_work+0x1e4/0x430
[597369.104202]  worker_thread+0x50/0x478
[597369.107963]  kthread+0x134/0x138
[597369.111286]  ret_from_fork+0x10/0x20
[597369.114958] Code: f94012f6 f94006d5 f94092a2 91007042 (b9400053) 
[597369.121172] ---[ end trace 0dc43e02d03f7b76 ]---

so it could be the NMVE being faulty causing the breakdown though there is quite some time between the timeout and the stack trace which also mentions "nvme_timeout". Next entry:

[599128.041657] Synchronous External Abort: synchronous external abort (0x96000210) at 0xffff000081000000
[599128.050998] Internal error: : 96000210 [#3] SMP
[599128.055626] Modules linked in: nf_log_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_TCPMSS nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_nat nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfnetlink_cthelper nfnetlink nfs lockd grace sunrpc fscache scsi_transport_iscsi af_packet tun ip_gre gre ip_tunnel openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_physdev br_netfilter bridge stp llc xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack libcrc32c ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 vfat fat nicvf cavium_ptp nicpf joydev thunder_bgx mdio_thunder mdio_cavium thunderx_edac thunder_xcv
[599128.126717]  cavium_rng_vf cavium_rng uio_pdrv_genirq aes_ce_blk crypto_simd uio cryptd aes_ce_cipher crc32_ce crct10dif_ce ghash_ce aes_arm64 ipmi_ssif ipmi_devintf ipmi_msghandler sha2_ce sha256_arm64 sha1_ce btrfs xor zstd_decompress zstd_compress xxhash zlib_deflate raid6_pq hid_generic usbhid ast xhci_pci i2c_algo_bit xhci_hcd drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm nvme nvme_core gpio_keys drm_panel_orientation_quirks usbcore thunderx_mmc i2c_thunderx i2c_smbus mmc_core dm_mirror dm_region_hash dm_log sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[599128.180483] CPU: 44 PID: 20557 Comm: lspci Tainted: G      D          4.12.14-lp151.28.36-default #1 openSUSE Leap 15.1
[599128.191358] Hardware name: GIGABYTE R120-T32/MT30-GS1, BIOS T19 09/29/2016
[599128.198327] task: ffff803c3c186100 task.stack: ffff803c420e0000
[599128.204340] pstate: 20000085 (nzCv daIf -PAN -UAO)
[599128.209228] pc : pci_generic_config_read+0x5c/0xf0
[599128.214112] lr : pci_generic_config_read+0x48/0xf0
[599128.218995] sp : ffff803c420e3c00
[599128.222402] x29: ffff803c420e3c00 x28: 0000000000000000 
[599128.227810] x27: ffff803c3c1c1b00 x26: 0000000000000004 
[599128.233217] x25: ffff803c3c1c1b00 x24: 000000000000000f 
[599128.238624] x23: 0000000000000000 x22: 0000000000000000 
[599128.244031] x21: ffff803c420e3cbc x20: 0000000000000004 
[599128.249438] x19: ffff803ee7543800 x18: 0000fffffb2ad007 
[599128.254846] x17: 0000ffff9ea7c970 x16: ffff0000082dae98 
[599128.260265] x15: 000000000000000a x14: 0000000000000000 
[599128.265675] x13: 0000000000000000 x12: 0000000000000020 
[599128.271083] x11: 0000fffffb2ad008 x10: 0000000000000000 
[599128.276490] x9 : 0000ffff9ea2b5d8 x8 : 0000000000000043 
[599128.281897] x7 : 000000000000008f x6 : 00000000000000c7 
[599128.287304] x5 : 0000000000000010 x4 : 000000000000008f 
[599128.292711] x3 : ffff000080000000 x2 : 0000000000000090 
[599128.298118] x1 : 0000000001000000 x0 : ffff000081000000 
[599128.303528] Process lspci (pid: 20557, stack limit = 0xffff803c420e0000)
[599128.310322] Call trace:
[599128.312866]  pci_generic_config_read+0x5c/0xf0
[599128.317408]  thunder_pem_config_read+0x78/0x268
[599128.322032]  pci_user_read_config_dword+0x70/0x108
[599128.326918]  pci_read_config+0xdc/0x228
[599128.330848]  sysfs_kf_bin_read+0x6c/0xa8
[599128.334865]  kernfs_fop_read+0xa8/0x208
[599128.338796]  __vfs_read+0x48/0x140
[599128.342292]  vfs_read+0x94/0x150
[599128.345614]  SyS_pread64+0x8c/0xa8
[599128.349111]  el0_svc_naked+0x44/0x48
[599128.352781] Code: 7100069f 54000180 71000a9f 54000280 (b9400001) 
[599128.359002] ---[ end trace 0dc43e02d03f7b78 ]---

one after that, some time later:

[602781.468092] INFO: rcu_sched detected stalls on CPUs/tasks:
[602781.473695]         2-...: (1 GPs behind) idle=fbe/140000000000000/0 softirq=6509573/6509574 fqs=2690 
[602781.482486]         (detected by 14, t=6003 jiffies, g=6715213, c=6715212, q=40961)
[602781.489635] Task dump for CPU 2:
[602781.492956] lspci           R  running task        0 26073   7344 0x00000002
[602781.500102] Call trace:
[602781.502649]  __switch_to+0xe4/0x150
[602781.506231]  0xffffffffffffffff
[602961.514252] INFO: rcu_sched detected stalls on CPUs/tasks:
[602961.519855]         2-...: (1 GPs behind) idle=fbe/140000000000000/0 softirq=6509573/6509574 fqs=10747 
[602961.528733]         (detected by 3, t=24008 jiffies, g=6715213, c=6715212, q=163507)
[602961.535968] Task dump for CPU 2:
[602961.539289] lspci           R  running task        0 26073   7344 0x00000002
[602961.546435] Call trace:
[602961.548981]  __switch_to+0xe4/0x150
[602961.552564]  0xffffffffffffffff

which then repeats. I reported the specific issue as https://bugzilla.opensuse.org/show_bug.cgi?id=1162612 and power cycled the machine.

Calling /sbin/lspci -vmm manually did not show any problem. The command succeeds within a second and seems to be fine.

#29 Updated by okurz 6 days ago

  • Priority changed from High to Low

No reaction on the bug yet. Reducing prio to "low" as our workarounds are good enough. We have other tickets for the reduced amount of worker instances as well. Lately at least openqaworker-arm-3 had trouble to come up after reboot due to my work on #19238 so let's give it more time to have unpolluted experiences.

Also available in: Atom PDF