action #116848
closedEnsure kdump is enabled and working on all OSD machines
Added by nicksinger about 2 years ago. Updated about 2 years ago.
0%
Description
Observation¶
With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/733 we introduce automatic reboots on all machines if the kernel panics. This could hide away problems. Therefore we should make sure to enable kdump on all machines to get reports if the kernel panics and fix the underlying problem. Rebooting should just be a workaround not the solution.
Acceptance Criteria¶
- AC1: kdump is enabled on all machines affected by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/733
- AC1.1: kdump config is managed by salt
- AC1.2: implement some notification channel for the crash dumps
Updated by okurz about 2 years ago
- Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Updated by okurz about 2 years ago
- Target version set to Ready
Fully agreeing. I mentioned enabling kdump in #116722 already
Updated by okurz about 2 years ago
- Status changed from New to Feedback
- Assignee set to okurz
Updated by okurz about 2 years ago
- Status changed from Feedback to New
- Assignee deleted (
okurz)
MR merged. Testing out kdump by triggering a crash. On openqaworker12 did echo c > /proc/sysrq-trigger
. On IPMI SOL we could monitor that the crashkernel boots and writes dump and log files to /kdump/mnt/var/crash/ so we should be able to find the stuff in the root file system later in /var/crash. But then an error occurred:
INFO: Email failed: SMTP server not reachable
We could not find helpful messages in the logfile which would tell us where the problem lies. So email sending needs to be investigated.
Also after reboot kdump services fail e.g. on backup.qa.suse.de as the crashkernel boot parameter is missing. Likely it was never enabled on installation. But we can move the corresponding part in salt that sets this boot parameter also to the kdump state.
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
- Assignee set to mkittler
Now related systemd units are failing:
martchus@openqa:~> sudo systemctl status kdump
× kdump.service - Load kdump kernel and initrd
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2022-10-09 03:32:57 CEST; 1 day 8h ago
Process: 1184 ExecStart=/usr/lib/kdump/load.sh --update (code=exited, status=1/FAILURE)
Main PID: 1184 (code=exited, status=1/FAILURE)
Okt 09 03:32:29 openqa systemd[1]: Starting Load kdump kernel and initrd...
Okt 09 03:32:57 openqa load.sh[1184]: Memory for crashkernel is not reserved
Okt 09 03:32:57 openqa load.sh[1184]: Please reserve memory by passing"crashkernel=X@Y" parameter to kernel
Okt 09 03:32:57 openqa load.sh[1184]: Then try to loading kdump kernel
Okt 09 03:32:57 openqa load.sh[1184]: Memory for crashkernel is not reserved
Okt 09 03:32:57 openqa load.sh[1184]: Please reserve memory by passing"crashkernel=X@Y" parameter to kernel
Okt 09 03:32:57 openqa load.sh[1184]: Then try to loading kdump kernel
Okt 09 03:32:57 openqa systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE
Okt 09 03:32:57 openqa systemd[1]: kdump.service: Failed with result 'exit-code'.
Okt 09 03:32:57 openqa systemd[1]: Failed to start Load kdump kernel and initrd.
I've been resetting those units for now.
Updated by mkittler about 2 years ago
Updated by openqa_review about 2 years ago
- Due date set to 2022-10-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
It still works on the workers as before with the updated /etc/default/grub
and /boot/grub2/grub.cfg
, e.g.:
Okt 11 09:44:41 QA-Power8-5-kvm kernel: Reserving 210MB of memory at 128MB for crashkernel (System RAM: 262144MB)
I cannot check other hosts besides OSD right as they're currently in maintenance but /etc/default/grub
and /boot/grub2/grub.cfg
look good on OSD.
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
Also looks good on other hosts:
Okt 11 10:49:20 monitor kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.21-default root=UUID=19732107-332c-4bef-8433-158534a48955 resume=/dev/vda1 splash=silent quiet showopts crashkernel=210M
…
Okt 11 10:49:20 monitor kernel: Reserving 210MB of memory at 2848MB for crashkernel (System RAM: 16383MB)
martchus@openqa-monitor:~> systemctl status kdump
● kdump.service - Load kdump kernel and initrd
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
Active: active (exited) since Tue 2022-10-11 10:51:13 CEST; 7min ago
Okt 11 10:49:26 backup-vm kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.21-default root=/dev/mapper/system-root resume=/dev/system/swap splash=silent quiet showopts crashkernel=210M
…
Okt 11 10:49:26 backup-vm kernel: Reserving 210MB of memory at 2848MB for crashkernel (System RAM: 4095MB)
martchus@backup-vm:~> systemctl status kdump
● kdump.service - Load kdump kernel and initrd
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: disabled)
Active: active (exited) since Tue 2022-10-11 10:51:38 CEST; 8min ago
Not sure whether we might need to introduce an override to configure different crashkernel sizes as the system RAM
Updated by okurz about 2 years ago
- Status changed from Resolved to Feedback
Good. I suggest you rerun the experiment from #116848#note-6 on a non-worker host to see if both the crashkernel is loaded correctly as well as kdump email sending works.
Updated by mkittler about 2 years ago
Oh, I've overlooked AC1.2. I suppose sending an email to osd-admins would make sense. However, considering #116848#note-6 there's likely no sufficient network setup in the kdump kernel env.
Updated by mkittler about 2 years ago
Maybe we need to change networking-related variables:
# Network configuration. Use "auto" for auto-detection in initrd, or a string
# that contains the network device and the mode (static, dhcp, dhcp6, auto6),
# separated by a colon. Example: "eth0:static" or "eth1:dhcp".
#
# For static configuration, you may have to add the configuration to
# KDUMP_COMMANDLINE_APPEND.
#
# By default, network is set up only if needed. Add ":force" to make sure
# that network is always available (e.g. for use by a custom pre- or
# post-script).
#
# See also: kdump(5)
#
KDUMP_NETCONFIG="auto"
## Type: integer
## Default: 30
## ServiceRestart: kdump
#
# Timeout for network changes. Kdumptool gives up waiting for a working
# network setup when it does not change for the given number of seconds.
#
# See also: kdump(5)
#
KDUMP_NET_TIMEOUT=30
Considering https://github.com/openSUSE/kdump/blob/cb129d0284e58c591a70736822efb01fd586e5da/kdumptool/configuration.cc#L159 and https://github.com/openSUSE/kdump/blob/cb129d0284e58c591a70736822efb01fd586e5da/dracut/module-setup.sh#L32 we should not need to force-enable networking.
Tried it again on ow20 but regarding network config, only the following is logged when a dump is collected:
[ 13.492309][ T303] ehci-pci 0000:00:12.2: irq 17, io mem 0xfe9fa800
Starting Netwo[ 13.507440][ T92] igb 0000:02:00.0: added PHC on eth0
rk Configuration[ 13.513777][ T92] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
...
[ 13.522856][ T92] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x4) 0c:c4:7a:7a:77:36
[ 13.531640][ T92] igb 0000:02:00.0: eth0: PBA No: FFFFFF-0FF
[ 13.537624][ T92] igb 0000:02:00.0: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
…
[ 13.979624][ T5] igb 0000:02:00.1: added PHC on eth1
[ 13.984990][ T5] igb 0000:02:00.1: Intel(R) Gigabit Ethernet Network Connection
[ 13.992694][ T5] igb 0000:02:00.1: eth1: (PCIe:2.5Gb/s:Width x4) 0c:c4:7a:7a:77:37
[ 14.000797][ T5] igb 0000:02:00.1: eth1: PBA No: FFFFFF-0FF
[ 14.006729][ T5] igb 0000:02:00.1: Using MSI-X interrupts. 1 rx queue(s), 1 tx queue(s)
[ 14.075281][ T303] hub 2-0:1.0: USB hub found
[ OK ] Started Network Configuration.
[ 14.087014][ T303] hub 2-0:1.0: 6 ports detected
[ OK ] Reached target Network.
But the following lines from the normal boot are missing (except [ OK ] Reached target Network.
which apparently doesn't mean much):
…
[ OK ] Reached target Network.
[ OK ] Reached target Network is Online.
…
openqaworker12 login: [ 212.237682][ T5] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 212.338506][ T5] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 212.363041][ C4] clocksource: timekeeping watchdog on CPU4: hpet wd-wd read-back delay of 709866ns
[ 212.428398][ C4] clocksource: wd-tsc-wd read-back delay of 354933ns, clock-skew test skipped!
[ 212.517651][ T3310] NET: Registered PF_PACKET protocol family
At least the dump appears under /var/crash/
as expected.
I'm also wondering whether our salt setup still missing something. At least Yast2 still thinks kdump is actually disabled and I don't see where we would trigger the generation of a new initrd to apply changes made to /etc/sysconfig/kdump
. (Considering scripts in /usr/lib/dracut/modules.d/99kdump
some of these settings affect the initrd creation.) That's a bit weird because all Yast2 checks is whether the kdump
service is enabled (see https://github.com/yast/yast-kdump/blob/06124bf3fed026364d504851b141b21552346c76/src/include/kdump/uifunctions.rb#L225) and it was definitely enabled. I now enabled kdump additionally via Yast2 on ow20 and now it also appears as enabled there. When I wanted to cross-check with other workers I've noticed that the newly introduced +=
-syntax makes Yast2 unusable for parts where it needs to parse /etc/default/grub
(e.g. kdump settings). Likely not a big deal since we use salt rather than yast anyways. I've just checked it out to see whether our salt config still misses something.
Looks like Yast2 rebuilds the initrd using these commands: update_command = (using_fadump? ? "/usr/sbin/mkdumprd -f" : "/sbin/mkinitrd")
Updated by okurz about 2 years ago
mkittler wrote:
I'm also wondering whether our salt setup still missing something. At least Yast2 still thinks kdump is actually disabled and I don't see where we would trigger the generation of a new initrd to apply changes made to
/etc/sysconfig/kdump
. (Considering scripts in/usr/lib/dracut/modules.d/99kdump
some of these settings affect the initrd creation.)
Hm, could be. I suggest to experiment with a simple VM, maybe JeOS based?
EDIT: mkittler and me had a nice pair-work session and we tried out kdump and crashes on multiple machines, a fresh Leap JeOS VM configured by yast2 kdump, jenkins.qa.suse.de as well as openqaworker12. jenkins.qa.suse.de and openqaworker.suse.de showed the same problems that sending emails was not possible, something about unreachable email server, likely due to the network not being fully reachable within the machines. I also read https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-tuning-kexec.html and asked in https://suse.slack.com/archives/C029APBKLGK/p1665497912564259 but found no more helpful information how to investigate or what to use as alternative.
- @mkittler what was the result from the JeOS VM experiment, did it manage to send emails? If yes then we should look for differences, if no, then I suggest to try out a different network configuration, e.g. static IP or NetworkManager or systemd-networkd instead of wicked.
- As alternative we could create a custom systemd service that looks for the presence of files in /var/crash and fails if yes which will also trigger our monitoring. Or send email from the rebooted production system as soon as such content was found. To prevent resending emails we should likely also move those files elsewhere or rename or put a separate file there to mark that we already notified about it
- Research more best practices how to handle detecting crashes and notifying admins about it
Updated by livdywan about 2 years ago
Alert's triggering right now:
2022-10-12 09:15:00 openqaworker12 transactional-update 1
2022-10-11 15:21:20 jenkins kdump-early, kdump, systemd-udev-settle 3
Updated by okurz about 2 years ago
- Status changed from Feedback to In Progress
I would say this is actually "In Progress" if you agree.
Updated by mkittler about 2 years ago
2022-10-11 15:21:20 jenkins kdump-early, kdump, systemd-udev-settle 3
Still from yesterday.
2022-10-12 09:15:00 openqaworker12 transactional-update 1
Still from yesterday, except transactional-update
.
When receiving an alert, it makes sense to select the according timeframe in the graph which then also filters the table accordingly. Then you would see that only transactional-update
on ow12 is relevant anymore. It doesn't seem to be necessarily related to this ticket but I'll have a look.
Updated by livdywan about 2 years ago
Steps:
- File a bug
- Block the ticket on this bug
- Implement the custom systemd service
Updated by mkittler about 2 years ago
And about the failing transactional-update
service: There was a file conflict. Maybe it was caused by triggering a kernel panic in the wrong moment leaving the file system in an inconsistent state. I've manually resolved the conflict and now transactional-update
doesn't fail anymore when starting it.
Updated by mkittler about 2 years ago
About the bug: I'm not sure whether the problems I saw in my tests with the JeOS VM were a product bug. Maybe suse.relay.de
cannot used that way. That the timeout for network availability is ignored so ow12/jenkins canot even reach the relay is a clear bug so I'll just report that.
Updated by mkittler about 2 years ago
I filed the bug: https://bugzilla.opensuse.org/show_bug.cgi?id=1204239
I'll continue with researching more best practices before implementing a custom systemd service. Not sure whether we should block the ticket on the bug.
Updated by mkittler about 2 years ago
After installing/enabling/configuring postfix in the JeOS image sending kdump mails works from there.
Updated by mkittler about 2 years ago
I did a quick web search and couldn't find much about best practices. Every distribution has its own way how kdump is configured. The SUSE docs actually mention to simply configure mail settings like we did. Docs of other distros didn't even mention configuring notifications.
For further debugging I tried to enter a shell in the kdump environment but couldn't figure out how.
Updated by okurz about 2 years ago
Next steps:
- Try to reproduce the problem in the clean VM environment, e.g. configure 100+ TAP devices and see if that slows down wicked enough to break
- Follow-up with a notification approach, e.g. as suggested in #116848#note-15
Updated by mkittler about 2 years ago
Somehow I could now manage to enter a shell in the kdump environment. The config from my previous attempts must have still been there and apparently it works after all. I actually wanted to test with 1000 tap devices. However, that test wasn't very helpful because I've nevertheless got the DHCP lease within seconds and the 1000 tap devices aren't making a difference at all. In fact, when I type ip addr
in the shell within kdump no tap devices are present at all¹. So it hasn't even picked up that config. Btw, when starting the regular (VM) system with 1000 tap devs I also didn't notice any delay due to it. Only the shutdown is a bit slower (waiting for wicket).
¹ The ethernet device is also called kdump0
in the kdump env so in that env there's certainly not the normal networking config present.
Updated by mkittler about 2 years ago
Since I couldn't reproduce it the VM (despite 1000 tap devs) I've just created a SR for getting notified via a systemd service: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/760
It is still a draft but I've tested the service locally and on openqaworker12 (and it works). We currently have some regularly failing workers but I don't think we need to exclude them here as they're unable to invoke kdump anyways.
Since sudo salt -C '*' cmd.run 'find /var/crash -type d -empty | read'
returned a red line for some machines I'll move those dumps in a backup directory before we would merge the SR.
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
I see you've been merging the SR. So I invoked sudo salt -C '*' cmd.run 'mkdir -p /var/crash-bak && mv --target-directory=/var/crash-bak /var/crash/*'
to ensure no old dumps trigger our monitoring.
I've checked on some workers and it looks like the systemd file has been deployed and enabled correctly.
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
I suppose it can be considered resolved for now.