action #88474
closedAll workers on powerqaworker-qam-1 are offline
0%
Description
All workers on powerqaworker-qam-1 are offline on OSD. We could see that from: https://openqa.suse.de/admin/workers
Cannot ping it, ipmitool
works well.
checked the network, the result shows that:
/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enP1p3s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq qlen 1000
link/ether 6c:ae:8b:69:21:74 brd ff:ff:ff:ff:ff:ff
3: enP1p3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:21:75 brd ff:ff:ff:ff:ff:ff
4: enP1p3s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:21:76 brd ff:ff:ff:ff:ff:ff
5: enP1p3s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:21:77 brd ff:ff:ff:ff:ff:ff
6: enP3p9s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:20:20 brd ff:ff:ff:ff:ff:ff
7: enP3p9s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:20:21 brd ff:ff:ff:ff:ff:ff
8: enP3p9s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:20:22 brd ff:ff:ff:ff:ff:ff
9: enP3p9s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
link/ether 6c:ae:8b:69:20:23 brd ff:ff:ff:ff:ff:ff
10: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
/ # ethtool enP1p3s0f0
Settings for enP1p3s0f0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: Unknown!
Duplex: Unknown! (255)
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: Unknown
Supports Wake-on: g
Wake-on: g
Current message level: 0x000000ff (255)
drv probe link timer ifdown ifup rx_err tx_err
Link detected: no
/ # SOL session closed by BMC
cable is unplugged?
Updated by livdywan almost 4 years ago
- Status changed from New to Workable
I can confirm the workers are Offline, and ssh powerqaworker-qam-1.qa.suse.de
gets stuck.
Updated by livdywan almost 4 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
In the petbitoot I see no devices, probably "the usual", so I'll attempt to manually reboot.
Errors observed via dmesg
for the record:
[ 25.235395] Btrfs loaded
[ 25.235857] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 2 transid 2171370 /dev/sda2
[ 25.237964] BTRFS info (device sda2): disk space caching is enabled
[ 25.237966] BTRFS: has skinny extents
[ 25.238767] BTRFS: failed to read the system array on sda2
Updated by livdywan almost 4 years ago
- Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added
Updated by livdywan almost 4 years ago
trying parsers for sdb2
parse error: 237('{'): syntax error, unexpected '{', expecting elif or else or fi
Confirmed "the usual" via cat /var/log/petitboot/pb-discover.log
. Rebooted via kexec -l /var/petitboot/mnt/dev/sda2/boot/vmlinux-5.3.18-lp152.57-default --initrd=/var/petitboot/mnt/dev/sda2/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_in tel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e
.
Updated by livdywan almost 4 years ago
- Related to action #88225: osd infrastructure: Many failed systemd services on various machines added
Updated by livdywan almost 4 years ago
- Status changed from In Progress to Feedback
systemctl list-units --failed
● logrotate.service loaded failed failed Rotate log files
systemctl status logrotate
● logrotate.service - Rotate log files
Loaded: loaded (/usr/lib/systemd/system/logrotate.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2021-02-08 11:40:38 CET; 9min ago
Docs: man:logrotate(8)
man:logrotate.conf(5)
Main PID: 9447 (code=exited, status=1/FAILURE)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Seems to be #88225#note-2 and chown openvswitch:openvswitch /var/log/openvswitch/ && systemctl restart logrotate
worked.
Not going to reboot for now, ignoring the extra space, assuming those parameters are irrelevant for non-intel.
Workers look to be coming back up.
Updated by livdywan almost 4 years ago
- Status changed from Feedback to Resolved
Workers seem good and I confirmed that they are processing jobs, so setting this to Resolved. The long-term issues causing this are reflected in the two related tickets.