Project

General

Profile

action #88474

All workers on powerqaworker-qam-1 are offline

Added by Xiaojing_liu 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-02-08
Due date:
% Done:

0%

Estimated time:

Description

All workers on powerqaworker-qam-1 are offline on OSD. We could see that from: https://openqa.suse.de/admin/workers

Cannot ping it, ipmitool works well.

checked the network, the result shows that:

/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enP1p3s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq qlen 1000
    link/ether 6c:ae:8b:69:21:74 brd ff:ff:ff:ff:ff:ff
3: enP1p3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:21:75 brd ff:ff:ff:ff:ff:ff
4: enP1p3s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:21:76 brd ff:ff:ff:ff:ff:ff
5: enP1p3s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:21:77 brd ff:ff:ff:ff:ff:ff
6: enP3p9s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:20:20 brd ff:ff:ff:ff:ff:ff
7: enP3p9s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:20:21 brd ff:ff:ff:ff:ff:ff
8: enP3p9s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:20:22 brd ff:ff:ff:ff:ff:ff
9: enP3p9s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 6c:ae:8b:69:20:23 brd ff:ff:ff:ff:ff:ff
10: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0
/ # ethtool enP1p3s0f0
Settings for enP1p3s0f0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Half 1000baseT/Full 
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Half 1000baseT/Full 
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: Unknown
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x000000ff (255)
                               drv probe link timer ifdown ifup rx_err tx_err
        Link detected: no
/ # SOL session closed by BMC

cable is unplugged?


Related issues

Related to openQA Infrastructure - action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for nowResolved2020-12-152021-04-16

Related to openQA Infrastructure - action #88225: osd infrastructure: Many failed systemd services on various machinesResolved2021-01-26

History

#1 Updated by cdywan 8 months ago

  • Status changed from New to Workable

I can confirm the workers are Offline, and ssh powerqaworker-qam-1.qa.suse.de gets stuck.

#2 Updated by cdywan 8 months ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

In the petbitoot I see no devices, probably "the usual", so I'll attempt to manually reboot.

Errors observed via dmesg for the record:

[   25.235395] Btrfs loaded
[   25.235857] BTRFS: device fsid e29496d5-0080-4a01-9bde-b786944f4ba4 devid 2 transid 2171370 /dev/sda2
[   25.237964] BTRFS info (device sda2): disk space caching is enabled
[   25.237966] BTRFS: has skinny extents
[   25.238767] BTRFS: failed to read the system array on sda2

#3 Updated by cdywan 8 months ago

  • Related to action #81058: [tracker-ticket] Power machines can't find installed OS. Automatic reboots disabled for now added

#4 Updated by cdywan 8 months ago

trying parsers for sdb2
parse error: 237('{'): syntax error, unexpected '{', expecting elif or else or fi

Confirmed "the usual" via cat /var/log/petitboot/pb-discover.log. Rebooted via kexec -l /var/petitboot/mnt/dev/sda2/boot/vmlinux-5.3.18-lp152.57-default --initrd=/var/petitboot/mnt/dev/sda2/boot/initrd-5.3.18-lp152.57-default --command-line="root=UUID=e29496d5-0080-4a01-9bde-b786944f4ba4 nospec kvm.nested=1 kvm_in tel.nested=1 kvm_amd.nested=1 kvm-arm.nested=1 crashkernel=210M" && kexec -e.

#5 Updated by cdywan 8 months ago

  • Related to action #88225: osd infrastructure: Many failed systemd services on various machines added

#6 Updated by cdywan 8 months ago

  • Status changed from In Progress to Feedback
systemctl list-units --failed
● logrotate.service loaded failed failed Rotate log files
systemctl status logrotate
● logrotate.service - Rotate log files
   Loaded: loaded (/usr/lib/systemd/system/logrotate.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-02-08 11:40:38 CET; 9min ago
     Docs: man:logrotate(8)
           man:logrotate.conf(5)
 Main PID: 9447 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Seems to be #88225#note-2 and chown openvswitch:openvswitch /var/log/openvswitch/ && systemctl restart logrotate worked.

Not going to reboot for now, ignoring the extra space, assuming those parameters are irrelevant for non-intel.

Workers look to be coming back up.

#7 Updated by okurz 8 months ago

  • Target version set to Ready

#8 Updated by cdywan 8 months ago

  • Status changed from Feedback to Resolved

Workers seem good and I confirmed that they are processing jobs, so setting this to Resolved. The long-term issues causing this are reflected in the two related tickets.

Also available in: Atom PDF