Project

General

Profile

action #116848

Ensure kdump is enabled and working on all OSD machines

Added by nicksinger 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2022-09-20
Due date:
2022-10-25
% Done:

0%

Estimated time:

Description

Observation

With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/733 we introduce automatic reboots on all machines if the kernel panics. This could hide away problems. Therefore we should make sure to enable kdump on all machines to get reports if the kernel panics and fix the underlying problem. Rebooting should just be a workaround not the solution.

Acceptance Criteria


Related issues

Related to openQA Infrastructure - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:MBlocked2022-09-182023-01-20

History

#1 Updated by nicksinger 2 months ago

  • Description updated (diff)

#2 Updated by nicksinger 2 months ago

  • Description updated (diff)

#3 Updated by okurz 2 months ago

  • Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added

#4 Updated by okurz 2 months ago

  • Target version set to Ready

Fully agreeing. I mentioned enabling kdump in #116722 already

#5 Updated by okurz 2 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz

#6 Updated by okurz about 2 months ago

  • Status changed from Feedback to New
  • Assignee deleted (okurz)

MR merged. Testing out kdump by triggering a crash. On openqaworker12 did echo c > /proc/sysrq-trigger. On IPMI SOL we could monitor that the crashkernel boots and writes dump and log files to /kdump/mnt/var/crash/ so we should be able to find the stuff in the root file system later in /var/crash. But then an error occurred:

INFO: Email failed: SMTP server not reachable

We could not find helpful messages in the logfile which would tell us where the problem lies. So email sending needs to be investigated.

Also after reboot kdump services fail e.g. on backup.qa.suse.de as the crashkernel boot parameter is missing. Likely it was never enabled on installation. But we can move the corresponding part in salt that sets this boot parameter also to the kdump state.

#7 Updated by mkittler about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

Now related systemd units are failing:

martchus@openqa:~> sudo systemctl status kdump
× kdump.service - Load kdump kernel and initrd
     Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Sun 2022-10-09 03:32:57 CEST; 1 day 8h ago
    Process: 1184 ExecStart=/usr/lib/kdump/load.sh --update (code=exited, status=1/FAILURE)
   Main PID: 1184 (code=exited, status=1/FAILURE)

Okt 09 03:32:29 openqa systemd[1]: Starting Load kdump kernel and initrd...
Okt 09 03:32:57 openqa load.sh[1184]: Memory for crashkernel is not reserved
Okt 09 03:32:57 openqa load.sh[1184]: Please reserve memory by passing"crashkernel=X@Y" parameter to kernel
Okt 09 03:32:57 openqa load.sh[1184]: Then try to loading kdump kernel
Okt 09 03:32:57 openqa load.sh[1184]: Memory for crashkernel is not reserved
Okt 09 03:32:57 openqa load.sh[1184]: Please reserve memory by passing"crashkernel=X@Y" parameter to kernel
Okt 09 03:32:57 openqa load.sh[1184]: Then try to loading kdump kernel
Okt 09 03:32:57 openqa systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE
Okt 09 03:32:57 openqa systemd[1]: kdump.service: Failed with result 'exit-code'.
Okt 09 03:32:57 openqa systemd[1]: Failed to start Load kdump kernel and initrd.

I've been resetting those units for now.

#9 Updated by openqa_review about 2 months ago

  • Due date set to 2022-10-25

Setting due date based on mean cycle time of SUSE QE Tools

#10 Updated by mkittler about 2 months ago

  • Status changed from In Progress to Feedback

It still works on the workers as before with the updated /etc/default/grub and /boot/grub2/grub.cfg, e.g.:

Okt 11 09:44:41 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)

I cannot check other hosts besides OSD right as they're currently in maintenance but /etc/default/grub and /boot/grub2/grub.cfg look good on OSD.

#11 Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

Also looks good on other hosts:

Okt 11 10:49:20 monitor kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.21-default root=UUID=19732107-332c-4bef-8433-158534a48955 resume=/dev/vda1 splash=silent quiet showopts crashkernel=210M
…
Okt 11 10:49:20 monitor kernel: Reserving 210MB of memory at 2848MB for crashkernel (System RAM: 16383MB)
martchus@openqa-monitor:~> systemctl status kdump
● kdump.service - Load kdump kernel and initrd
     Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
     Active: active (exited) since Tue 2022-10-11 10:51:13 CEST; 7min ago
Okt 11 10:49:26 backup-vm kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.21-default root=/dev/mapper/system-root resume=/dev/system/swap splash=silent quiet showopts crashkernel=210M
…
Okt 11 10:49:26 backup-vm kernel: Reserving 210MB of memory at 2848MB for crashkernel (System RAM: 4095MB)
martchus@backup-vm:~> systemctl status kdump
● kdump.service - Load kdump kernel and initrd
     Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
     Active: active (exited) since Tue 2022-10-11 10:51:38 CEST; 8min ago

Not sure whether we might need to introduce an override to configure different crashkernel sizes as the system RAM

#12 Updated by okurz about 2 months ago

  • Status changed from Resolved to Feedback

Good. I suggest you rerun the experiment from #116848#note-6 on a non-worker host to see if both the crashkernel is loaded correctly as well as kdump email sending works.

#13 Updated by mkittler about 2 months ago

Oh, I've overlooked AC1.2. I suppose sending an email to osd-admins would make sense. However, considering #116848#note-6 there's likely no sufficient network setup in the kdump kernel env.

#14 Updated by mkittler about 2 months ago

Maybe we need to change networking-related variables:

# Network configuration. Use "auto" for auto-detection in initrd, or a string
# that contains the network device and the mode (static, dhcp, dhcp6, auto6),
# separated by a colon. Example: "eth0:static" or "eth1:dhcp".
#
# For static configuration, you may have to add the configuration to
# KDUMP_COMMANDLINE_APPEND.
#
# By default, network is set up only if needed. Add ":force" to make sure
# that network is always available (e.g. for use by a custom pre- or
# post-script).
#
# See also: kdump(5)
#
KDUMP_NETCONFIG="auto"

## Type:        integer
## Default:     30
## ServiceRestart:      kdump
#
# Timeout for network changes. Kdumptool gives up waiting for a working
# network setup when it does not change for the given number of seconds.
#
# See also: kdump(5)
#
KDUMP_NET_TIMEOUT=30

Considering https://github.com/openSUSE/kdump/blob/cb129d0284e58c591a70736822efb01fd586e5da/kdumptool/configuration.cc#L159 and https://github.com/openSUSE/kdump/blob/cb129d0284e58c591a70736822efb01fd586e5da/dracut/module-setup.sh#L32 we should not need to force-enable networking.


Tried it again on ow20 but regarding network config, only the following is logged when a dump is collected:

[   13.492309][  T303] ehci-pci 0000:00:12.2: irq 17, io mem 0xfe9fa800
         Starting Netwo[   13.507440][   T92] igb 0000:02:00.0: added PHC on eth0
rk Configuration[   13.513777][   T92] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
...
[   13.522856][   T92] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 0c:c4:7a:7a:77:36
[   13.531640][   T92] igb 0000:02:00.0: eth0: PBA No: FFFFFF-0FF
[   13.537624][   T92] igb 0000:02:00.0: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
…
[   13.979624][    T5] igb 0000:02:00.1: added PHC on eth1
[   13.984990][    T5] igb 0000:02:00.1: Intel(R) Gigabit Ethernet Network Connection
[   13.992694][    T5] igb 0000:02:00.1: eth1: (PCIe:2.5Gb/s:Width x4) 0c:c4:7a:7a:77:37
[   14.000797][    T5] igb 0000:02:00.1: eth1: PBA No: FFFFFF-0FF
[   14.006729][    T5] igb 0000:02:00.1: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
[   14.075281][  T303] hub 2-0:1.0: USB hub found
[  OK  ] Started Network Configuration.
[   14.087014][  T303] hub 2-0:1.0: 6 ports detected
[  OK  ] Reached target Network.

But the following lines from the normal boot are missing (except [ OK ] Reached target Network. which apparently doesn't mean much):

…
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
…
openqaworker12 login: [  212.237682][    T5] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  212.338506][    T5] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  212.363041][    C4] clocksource: timekeeping watchdog on CPU4: hpet wd-wd read-back delay of 709866ns
[  212.428398][    C4] clocksource: wd-tsc-wd read-back delay of 354933ns, clock-skew test skipped!
[  212.517651][ T3310] NET: Registered PF_PACKET protocol family

At least the dump appears under /var/crash/ as expected.


I'm also wondering whether our salt setup still missing something. At least Yast2 still thinks kdump is actually disabled and I don't see where we would trigger the generation of a new initrd to apply changes made to /etc/sysconfig/kdump. (Considering scripts in /usr/lib/dracut/modules.d/99kdump some of these settings affect the initrd creation.) That's a bit weird because all Yast2 checks is whether the kdump service is enabled (see https://github.com/yast/yast-kdump/blob/06124bf3fed026364d504851b141b21552346c76/src/include/kdump/uifunctions.rb#L225) and it was definitely enabled. I now enabled kdump additionally via Yast2 on ow20 and now it also appears as enabled there. When I wanted to cross-check with other workers I've noticed that the newly introduced +=-syntax makes Yast2 unusable for parts where it needs to parse /etc/default/grub (e.g. kdump settings). Likely not a big deal since we use salt rather than yast anyways. I've just checked it out to see whether our salt config still misses something.


Looks like Yast2 rebuilds the initrd using these commands: update_command = (using_fadump? ? "/usr/sbin/mkdumprd -f" : "/sbin/mkinitrd")

#15 Updated by okurz about 2 months ago

mkittler wrote:

I'm also wondering whether our salt setup still missing something. At least Yast2 still thinks kdump is actually disabled and I don't see where we would trigger the generation of a new initrd to apply changes made to /etc/sysconfig/kdump. (Considering scripts in /usr/lib/dracut/modules.d/99kdump some of these settings affect the initrd creation.)

Hm, could be. I suggest to experiment with a simple VM, maybe JeOS based?

EDIT: mkittler and me had a nice pair-work session and we tried out kdump and crashes on multiple machines, a fresh Leap JeOS VM configured by yast2 kdump, jenkins.qa.suse.de as well as openqaworker12. jenkins.qa.suse.de and openqaworker.suse.de showed the same problems that sending emails was not possible, something about unreachable email server, likely due to the network not being fully reachable within the machines. I also read https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-tuning-kexec.html and asked in https://suse.slack.com/archives/C029APBKLGK/p1665497912564259 but found no more helpful information how to investigate or what to use as alternative.

  1. mkittler what was the result from the JeOS VM experiment, did it manage to send emails? If yes then we should look for differences, if no, then I suggest to try out a different network configuration, e.g. static IP or NetworkManager or systemd-networkd instead of wicked.
  2. As alternative we could create a custom systemd service that looks for the presence of files in /var/crash and fails if yes which will also trigger our monitoring. Or send email from the rebooted production system as soon as such content was found. To prevent resending emails we should likely also move those files elsewhere or rename or put a separate file there to mark that we already notified about it
  3. Research more best practices how to handle detecting crashes and notifying admins about it

#16 Updated by cdywan about 2 months ago

Alert's triggering right now:

2022-10-12 09:15:00 openqaworker12 transactional-update 1
2022-10-11 15:21:20 jenkins kdump-early, kdump, systemd-udev-settle 3

#17 Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress

I would say this is actually "In Progress" if you agree.

#18 Updated by mkittler about 2 months ago

2022-10-11 15:21:20 jenkins kdump-early, kdump, systemd-udev-settle 3

Still from yesterday.

2022-10-12 09:15:00 openqaworker12 transactional-update 1

Still from yesterday, except transactional-update.


When receiving an alert, it makes sense to select the according timeframe in the graph which then also filters the table accordingly. Then you would see that only transactional-update on ow12 is relevant anymore. It doesn't seem to be necessarily related to this ticket but I'll have a look.

#19 Updated by cdywan about 2 months ago

Steps:

  1. File a bug
  2. Block the ticket on this bug
  3. Implement the custom systemd service

#20 Updated by mkittler about 2 months ago

And about the failing transactional-update service: There was a file conflict. Maybe it was caused by triggering a kernel panic in the wrong moment leaving the file system in an inconsistent state. I've manually resolved the conflict and now transactional-update doesn't fail anymore when starting it.

#21 Updated by mkittler about 2 months ago

About the bug: I'm not sure whether the problems I saw in my tests with the JeOS VM were a product bug. Maybe suse.relay.de cannot used that way. That the timeout for network availability is ignored so ow12/jenkins canot even reach the relay is a clear bug so I'll just report that.

#22 Updated by mkittler about 2 months ago

I filed the bug: https://bugzilla.opensuse.org/show_bug.cgi?id=1204239

I'll continue with researching more best practices before implementing a custom systemd service. Not sure whether we should block the ticket on the bug.

#23 Updated by mkittler about 2 months ago

After installing/enabling/configuring postfix in the JeOS image sending kdump mails works from there.

#24 Updated by mkittler about 2 months ago

I did a quick web search and couldn't find much about best practices. Every distribution has its own way how kdump is configured. The SUSE docs actually mention to simply configure mail settings like we did. Docs of other distros didn't even mention configuring notifications.

For further debugging I tried to enter a shell in the kdump environment but couldn't figure out how.

#25 Updated by okurz about 2 months ago

Next steps:

  • Try to reproduce the problem in the clean VM environment, e.g. configure 100+ TAP devices and see if that slows down wicked enough to break
  • Follow-up with a notification approach, e.g. as suggested in #116848#note-15

#26 Updated by mkittler about 1 month ago

Somehow I could now manage to enter a shell in the kdump environment. The config from my previous attempts must have still been there and apparently it works after all. I actually wanted to test with 1000 tap devices. However, that test wasn't very helpful because I've nevertheless got the DHCP lease within seconds and the 1000 tap devices aren't making a difference at all. In fact, when I type ip addr in the shell within kdump no tap devices are present at all¹. So it hasn't even picked up that config. Btw, when starting the regular (VM) system with 1000 tap devs I also didn't notice any delay due to it. Only the shutdown is a bit slower (waiting for wicket).


¹ The ethernet device is also called kdump0 in the kdump env so in that env there's certainly not the normal networking config present.

#27 Updated by mkittler about 1 month ago

Since I couldn't reproduce it the VM (despite 1000 tap devs) I've just created a SR for getting notified via a systemd service: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/760

It is still a draft but I've tested the service locally and on openqaworker12 (and it works). We currently have some regularly failing workers but I don't think we need to exclude them here as they're unable to invoke kdump anyways.

Since sudo salt -C '*' cmd.run 'find /var/crash -type d -empty | read' returned a red line for some machines I'll move those dumps in a backup directory before we would merge the SR.

#28 Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

I see you've been merging the SR. So I invoked sudo salt -C '*' cmd.run 'mkdir -p /var/crash-bak && mv --target-directory=/var/crash-bak /var/crash/*' to ensure no old dumps trigger our monitoring.

I've checked on some workers and it looks like the systemd file has been deployed and enabled correctly.

#29 Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

I suppose it can be considered resolved for now.

Also available in: Atom PDF