Project

General

Profile

action #88900

openqaworker13 was unreachable

Added by Xiaojing_liu 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-02-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

openqaworker13 was unreachable on 2021-02-22.

Acceptance criteria

  • AC1: machine is reboot-reliable, e.g. ensured by multiple reboots without failures after boot

Problem

Cannot connect to this machine ping.

using ipmitool sol activate connect to this machine, shows:

[FAILED] Failed to start save kernel crash dump.
See 'systemctl status kdump-save.service' for details.
         Starting Reload Configuration from the Real Root...
[  OK  ] Started Reload Configuration from the Real Root.
[  OK  ] Reached target Initrd File Systems.
[  OK  ] Reached target Initrd Default Target.
         Starting dracut pre-pivot and cleanup hook...
[  OK  ] Started dracut pre-pivot and cleanup hook.
         Starting Cleaning Up and Shutting Down Daemons...
[  OK  ] Stopped target Timers.
[  OK  ] Stopped dracut pre-pivot and cleanup hook.
[  OK  ] Stopped target Initrd Default Target.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Slices.
[  OK  ] Stopped target Initrd Root Device.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
         Unmounting /kdump/mnt1/var/crash...
         Unmounting /kdump/mnt0...
         Stopping udev Kernel Device Manager...
[  OK  ] Stopped target Sockets.
[  OK  ] Stopped Load Kernel Modules.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped target Swap.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Remote File Systems (Pre).
[  OK  ] Stopped dracut initqueue hook.
[  OK  ] Stopped udev Coldplug all Devices.
[  OK  ] Stopped udev Kernel Device Manager.
[  OK  ] Unmounted /kdump/mnt1/var/crash.
[  OK  ] Started Cleaning Up and Shutting Down Daemons.
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create list of required sta…vice nodes for the current kernel.
[  OK  ] Stopped dracut pre-udev hook.
[  OK  ] Stopped dracut cmdline hook.
[  OK  ] Stopped dracut ask for additional cmdline parameters.
[  OK  ] Closed udev Control Socket.
[  OK  ] Closed udev Kernel Socket.
         Starting Cleanup udevd DB...
[  OK  ] Started Cleanup udevd DB.
[  OK  ] Reached target Switch Root.
         Starting Switch Root...
[FAILED] Failed to start Switch Root.
See 'systemctl status initrd-switch-root.service' for details.
         Starting Setup Virtual Console...
[  OK  ] Unmounted /kdump/mnt0.
[  OK  ] Stopped File System Check on /dev/d…3d875-0e24-49a9-8dc8-ddf8cf83f04e.
[  OK  ] Started Setup Virtual Console.
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.

Generating "/run/initramfs/rdsosreport.txt"


Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.


Give root password for maintenance
(or press Control-D to continue):
[716615.675131] BTRFS info (device sda2): disk space caching is enabled
[716615.675134] BTRFS info (device sda2): has skinny extents
[716614.923484] dracut-cmdline[499]: mv: cannot stat '/etc/multipath.conf.kdump': No such file or directory
Unable to ioctl(KDSETLED) -- are you not on the console? (Inappropriate ioctl for device)
Extracting dmesg
-------------------------------------------------------------------------------

The dmesg log is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/dmesg.txt.
makedumpfile Completed.
-------------------------------------------------------------------------------
Saving dump using makedumpfile
-------------------------------------------------------------------------------
Copying data                                      : [100.0 %] /           eta: 0s

The dumpfile is saved to /kdump/mnt1/var/crash/2021-02-22-04:16/vmcore.

makedumpfile Completed.
-------------------------------------------------------------------------------

Then did ipmitool power cycle. The workers on that host works well for now.

Suggestions

  • Remove from salt
  • Identify problem
  • Ensure stability with multiple reboots
  • Add back to salt after stable

Workaround

power cycle over IPMI


Related issues

Related to openQA Infrastructure - action #89497: flaky Failed systemd services alert (except openqa.suse.de)Resolved2021-03-042021-03-31

Related to openQA Infrastructure - action #89551: NFS mount fails after boot (reproducible on some OSD workers)Resolved2021-03-052021-03-31

History

#1 Updated by okurz 5 months ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by mkittler 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

#3 Updated by mkittler 5 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (mkittler)

From the logs it is hard to tell what's wrong. I suppose one really had to reboot the machine a few times to find out what's wrong (and remove it from salt before). (I'm not doing that now and hence unassign myself again.)

#4 Updated by mkittler 5 months ago

  • Assignee set to mkittler

#5 Updated by mkittler 5 months ago

I've power-cycled and regularly restarted the machine. The only problem I've found so far is that the NFS mount doesn't work. It correctly depends on network-online.target:

systemctl list-dependencies var-lib-openqa-share.mount 
var-lib-openqa-share.mount
● ├─-.mount
● ├─system.slice
● ├─var-lib-openqa.mount
● └─network-online.target
●   └─wicked.service

However, reaching that target does not mean the eth network connection is already up:

[  OK  ] Started Xinetd A Powerful Replacement For Inetd.
[  OK  ] Started The plugin-driven server ag…r reporting metrics into InfluxDB.
         Starting os-autoinst openvswitch helper...
         Starting NTP Server Daemon...
         Starting The Salt Minion...
         Starting Daemon to remotely execute Nagios plugins...
[  OK  ] Reached target Network is Online.
         Mounting /var/lib/openqa/share...
         Starting Login and scanning of iSCSI devices...
         Starting Reboot Manager...
         Starting Load kdump kernel and initrd...
[FAILED] Failed to mount /var/lib/openqa/share.
See 'systemctl status var-lib-openqa-share.mount' for details.
[DEPEND] Dependency failed for Remote File Systems.
[  OK  ] Started Login and scanning of iSCSI devices.
         Starting Permit User Sessions...
[  OK  ] Started OpenQA Worker Cache Service.
[  OK  ] Started OpenQA Worker Cache Service Minion.
[  OK  ] Started Daemon to remotely execute Nagios plugins.
[  OK  ] Started Permit User Sessions.
[  OK  ] Started Reboot Manager.
[  OK  ] Started OpenSSH Daemon.
[  OK  ] Started NTP Server Daemon.
[  OK  ] Reached target System Time Synchronized.
[  OK  ] Started daily update of the root trust anchor for DNSSEC.
[  OK  ] Started Timeline of Snapper Snapshots.
[  OK  ] Started Scrub btrfs filesystem, verify block checksums.
         Starting Postfix Mail Transport Agent...
[  OK  ] Started Backup of /etc/sysconfig.
[  OK  ] Started Check if mainboard battery is Ok.
[  OK  ] Started Backup of RPM database.
[  OK  ] Started Discard unused blocks on a mounted filesystem.
[  OK  ] Started Nightly trigger of auto-update..
[  OK  ] Started Daily rotation of log files.
[  OK  ] Started Balance block groups on a btrfs filesystem.
[  OK  ] Started Discard unused blocks once a week.
[  OK  ] Reached target Timers.
[  OK  ] Started Load kdump kernel and initrd.
[  OK  ] Started tgt admin.
[  OK  ] Started Update cron periods from /etc/sysconfig/btrfsmaintenance.
[  OK  ] Started Postfix Mail Transport Agent.
[  OK  ] Started Command Scheduler.
[  OK  ] Started The Salt Minion.
[   87.630281] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   87.640177] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  OK  ] Stopped OpenQA Worker Cache Service Minion.
[  OK  ] Started OpenQA Worker Cache Service Minion.
[  OK  ] Started os-autoinst openvswitch helper.
[  118.440908] NET: Registered protocol family 17
[  OK  ] Started Timeline of Snapper Snapshots.

#6 Updated by mkittler 5 months ago

When waiting long enough the problem fixes itself:

[  OK  ] Reached target Network.
         Starting The Salt Minion...
[  OK  ] Started The plugin-driven server ag…r reporting metrics into InfluxDB.
[  OK  ] Started Xinetd A Powerful Replacement For Inetd.
         Starting os-autoinst openvswitch helper...
         Starting tgt admin...
         Starting Daemon to remotely execute Nagios plugins...
         Starting OpenSSH Daemon...
         Starting NTP Server Daemon...
[  OK  ] Reached target Network is Online.
         Starting Login and scanning of iSCSI devices...
         Mounting /var/lib/openqa/share...
         Starting Reboot Manager...
[  OK  ] Started Daemon to remotely execute Nagios plugins.
[  OK  ] Started Login and scanning of iSCSI devices.
[FAILED] Failed to mount /var/lib/openqa/share.
See 'systemctl status var-lib-openqa-share.mount' for details.
[DEPEND] Dependency failed for Remote File Systems.
[  OK  ] Started Reboot Manager.
[  OK  ] Started OpenQA Worker Cache Service.
[  OK  ] Started OpenQA Worker Cache Service Minion.
         Starting Permit User Sessions...
[  OK  ] Started OpenSSH Daemon.
[  OK  ] Started Permit User Sessions.
[  OK  ] Started NTP Server Daemon.
[  OK  ] Reached target System Time Synchronized.
[  OK  ] Started Backup of /etc/sysconfig.
[  OK  ] Started Timeline of Snapper Snapshots.
[  OK  ] Started Discard unused blocks once a week.
[  OK  ] Started Nightly trigger of auto-update..
[  OK  ] Started Balance block groups on a btrfs filesystem.
[  OK  ] Started Daily rotation of log files.
[  OK  ] Started Discard unused blocks on a mounted filesystem.
[  OK  ] Started daily update of the root trust anchor for DNSSEC.
[  OK  ] Started Backup of RPM database.
         Starting Postfix Mail Transport Agent...
[  OK  ] Started Scrub btrfs filesystem, verify block checksums.
[  OK  ] Started Check if mainboard battery is Ok.
[  OK  ] Reached target Timers.
[  OK  ] Started Load kdump kernel and initrd.
[  OK  ] Started tgt admin.
[   81.850074] device tap0 entered promiscuous mode
[   81.891042] device tap1 entered promiscuous mode
[   81.923938] device tap10 entered promiscuous mode
[   82.001237] device tap11 entered promiscuous mode
[   82.032244] device tap12 entered promiscuous mode
[   82.068069] device tap128 entered promiscuous mode
[   82.101483] device tap129 entered promiscuous mode
[   82.158279] device tap13 entered promiscuous mode
[   82.224777] device tap130 entered promiscuous mode
[  OK  ] Started Update cron periods from /etc/sysconfig/btrfsmaintenance.
[   82.256890] device tap131 entered promiscuous mode
[   82.289120] device tap132 entered promiscuous mode
[   82.321944] device tap133 entered promiscuous mode
[   82.420014] device tap134 entered promiscuous mode
[   82.453938] device tap135 entered promiscuous mode
[   82.487335] device tap136 entered promiscuous mode
[   82.522267] device tap137 entered promiscuous mode
[   82.593083] device tap138 entered promiscuous mode
[   82.630759] device tap139 entered promiscuous mode
[   82.671474] device tap14 entered promiscuous mode
[   82.703423] device tap140 entered promiscuous mode
[   82.743970] device tap141 entered promiscuous mode
[   82.785203] device tap142 entered promiscuous mode
[   82.834132] device tap143 entered promiscuous mode
[   82.875842] device tap15 entered promiscuous mode
[   82.916336] device tap2 entered promiscuous mode
[   82.956525] device tap3 entered promiscuous mode
[   83.035539] device tap4 entered promiscuous mode
[   83.067627] device tap5 entered promiscuous mode
[   83.105731] device tap6 entered promiscuous mode
[   83.136749] device tap64 entered promiscuous mode
[   83.224059] device tap65 entered promiscuous mode
[  OK  ] Started Postfix Mail Transport Agent.
[  OK  ] Started Comm[   83.258271] device tap66 entered promiscuous mode
and Scheduler.
[   83.304535] device tap67 entered promiscuous mode
[   83.336878] device tap68 entered promiscuous mode
[   83.397169] device tap69 entered promiscuous mode
[   83.440219] device tap7 entered promiscuous mode
[   83.481695] device tap70 entered promiscuous mode
[   83.517879] device tap71 entered promiscuous mode
[   83.567131] device tap72 entered promiscuous mode
[   83.622994] device tap73 entered promiscuous mode
[   83.662240] device tap74 entered promiscuous mode
[   83.692019] device tap75 entered promiscuous mode
[   83.728324] device tap76 entered promiscuous mode
[   83.767797] device tap77 entered promiscuous mode
[   83.851449] device tap78 entered promiscuous mode
[   83.886525] device tap79 entered promiscuous mode
[   83.924902] device tap8 entered promiscuous mode
[   83.992720] device tap9 entered promiscuous mode
[  OK  ] Started The Salt Minion.
[   84.810484] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   84.820319] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  OK  ] Stopped OpenQA Worker Cache Service Minion.
[  OK  ] Started OpenQA Worker Cache Service Minion.
[  OK  ] Started os-autoinst openvswitch helper.
[  116.389048] NET: Registered protocol family 17
         Mounting /var/lib/openqa/share...
[  342.696380] FS-Cache: Loaded
[  342.798608] RPC: Registered named UNIX socket transport module.
[  342.805011] RPC: Registered udp transport module.
[  342.810138] RPC: Registered tcp transport module.
[  342.815201] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  342.914501] FS-Cache: Netfs 'nfs' registered for caching
[  342.977172] Key type dns_resolver registered
[  343.257046] NFS: Registering the id_resolver key type
[  343.262732] Key type id_resolver registered
[  343.267860] Key type id_legacy registered
[  OK  ] Listening on RPCbind Server Activation Socket.
         Starting Notify NFS peers of a restart...
[  OK  ] Reached target RPC Port Mapper.
         Starting NFS status monitor for NFSv2/3 locking....
[  OK  ] Started Notify NFS peers of a restart.
         Starting RPC Bind...
[  OK  ] Started RPC Bind.
[  OK  ] Started NFS status monitor for NFSv2/3 locking..
[  OK  ] Mounted /var/lib/openqa/share.
[  OK  ] Started /etc/init.d/boot.local Compatibility.
[  OK  ] Started Serial Getty on ttyS1.
[  OK  ] Started Getty on tty1.
sudo journalctl -fu var-lib-openqa-share.mount
-- Logs begin at Thu 2018-10-04 14:09:46 CEST. --
Mär 04 14:32:57 openqaworker13 systemd[1]: Mounting /var/lib/openqa/share...
Mär 04 14:32:59 openqaworker13 systemd[1]: Mounted /var/lib/openqa/share.
Mär 04 14:38:32 openqaworker13 systemd[1]: Unmounting /var/lib/openqa/share...
Mär 04 14:38:33 openqaworker13 systemd[1]: Unmounted /var/lib/openqa/share.
-- Reboot --
Mär 04 14:43:14 openqaworker13 systemd[1]: Mounting /var/lib/openqa/share...
Mär 04 14:43:14 openqaworker13 systemd[1]: var-lib-openqa-share.mount: Mount process exited, code=exited status=32
Mär 04 14:43:14 openqaworker13 systemd[1]: Failed to mount /var/lib/openqa/share.
Mär 04 14:43:14 openqaworker13 systemd[1]: var-lib-openqa-share.mount: Unit entered failed state.
Mär 04 14:47:36 openqaworker13 systemd[1]: Mounting /var/lib/openqa/share...
Mär 04 14:47:38 openqaworker13 systemd[1]: Mounted /var/lib/openqa/share.

Not sure whether the re-try can be configured (e.g. to make it more frequent).

#7 Updated by cdywan 5 months ago

  • Related to action #89497: flaky Failed systemd services alert (except openqa.suse.de) added

#8 Updated by mkittler 5 months ago

I did power cycle, power reset and some more regular reboots. The mentioned NFS problem was always reproducible but the system booted without problems. So what ever causes the initial problems, it is not apparent anymore. So we can likely close the ticket.

Not sure about the NFS problem. My reboots and the NFS problem together causes the monitoring alerts (so the alerts can be ignored).

By the way, since the worker was offline for a while it hadn't had my recent manual setup changes applied. I took the opportunity to do that now. It should now be on-par with the other workers.

#9 Updated by mkittler 5 months ago

  • Status changed from Workable to Feedback

#10 Updated by okurz 5 months ago

I suggest we still handle the NFS issue. If you prefer make that a separate ticket though

#11 Updated by mkittler 5 months ago

  • Status changed from Feedback to Resolved

I created a separate ticket for the NFS problem: #89551

#12 Updated by okurz 5 months ago

  • Related to action #89551: NFS mount fails after boot (reproducible on some OSD workers) added

Also available in: Atom PDF