action #62267
closed[sle][Migration][SLE15SP2][Regression] test fails in install_service - kdump lead system crash
100%
Description
Observation¶
openQA test in scenario sle-15-SP2-Regression-on-Migration-from-SLE12-SP5-to-SLE15-SP2-ppc64le-offline_sles12sp5_pscc_sdk-lp-asmm-contm-lgm-tcm-wsm_all_full@ppc64le fails in
install_service
Test suite description¶
Reproducible¶
Fails since (at least) Build 104.1
Expected result¶
Last good: (unknown) (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by coolgw over 4 years ago
- Subject changed from [sle][Migration][SLE15SP2] test fails in install_service - cause exceed 2h timout of the job to [sle][Migration][SLE15SP2] test fails in install_service - kdump lead system crash
Updated by leli over 4 years ago
- Assignee set to hjluo
We can set much more value for QEMURAM to see whether can fix this issue.
Updated by leli over 4 years ago
- Subject changed from [sle][Migration][SLE15SP2] test fails in install_service - kdump lead system crash to [sle][Migration][SLE15SP2][Regression] test fails in install_service - kdump lead system crash
Updated by hjluo about 4 years ago
it passed on 134.1 https://openqa.nue.suse.com/tests/3853705.
it looks like kernel crash with accessing invalid address.
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048
NUMA
pSeries
Modules linked in: fuse nf_log_ipv6 xt_comment nf_log_ipv4 nf_log_common xt_LOG xt_limit scsi_transport_iscsi af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack libcrc32c ip6table_filter ip6_tables x_tables joydev vmx_crypto gf128mul crct10dif_vpmsum rtc_generic virtio_net net_failover failover kgraft_patch_1_8_5_1(OK) hid_generic usbhid btrfs xor zstd_decompress zstd_compress xxhash sr_mod cdrom raid6_pq virtio_blk virtio_scsi virtio_console bochs_drm drm_kms_helper crc32c_vpmsum syscopyarea sysfillrect sysimgblt fb_sys_fops ttm xhci_pci xhci_hcd drm usbcore drm_panel_orientation_quirks agpgart virtio_pci virtio_ring virtio sg dm_multipath dm_mod scsi_dh_rdac
scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: Yes
CPU: 0 PID: 4549 Comm: bash Tainted: G O K 4.12.14-122.12-default #1 SLE12-SP5
task: c000000038793a00 task.stack: c000000038ef4000
NIP: c0000000006754d8 LR: c000000000676660 CTR: c0000000006754a0
REGS: c000000038ef7960 TRAP: 0300 Tainted: G O K (4.12.14-122.12-default)
MSR: 8000000000009033
CR: 22424422 XER: 00000000
CFAR: c000000000002458 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
GPR00: c000000000676660 c000000038ef7be0 c0000000011d1f00 0000000000000063
GPR04: c00000003fe0ade8 c00000003fe22a00 7269676765722061 2063726173680d0a
GPR08: 0000000000000000 0000000000000001 0000000000000000 0000000000000172
GPR12: c0000000006754a0 c0000000076c0000 0000000010063538 0000010032ff7560
GPR16: 0000000000000000 0000000000000000 0000010032e239a0 00000000100f6c68
GPR20: 00000000100bf5b8 0000000000000000 00000000100f7848 0000010032ffd600
GPR24: 00000000100622e8 0000000000000000 0000000000000001 0000000000000000
GPR28: c0000000010c1d40 0000000000000063 c000000001091d0c c0000000010c2100
NIP [c0000000006754d8] sysrq_handle_crash+0x38/0x50
LR [c000000000676660] __handle_sysrq+0xf0/0x270
Call Trace:
[c000000038ef7be0] [c0000000009a88f4] printk+0x44/0x58 (unreliable)
[c000000038ef7c00] [c000000000676660] __handle_sysrq+0xf0/0x270
[c000000038ef7ca0] [c000000000676e1c] write_sysrq_trigger+0x6c/0xb0
[c000000038ef7cd0] [c00000000045829c] proc_reg_write+0x8c/0xd0
[c000000038ef7d00] [c0000000003aef8c] __vfs_write+0x4c/0x210
[c000000038ef7d90] [c0000000003b0e54] vfs_write+0xd4/0x250
[c000000038ef7de0] [c0000000003b2c6c] SyS_write+0x6c/0x110
[c000000038ef7e30] [c00000000000b188] system_call+0x3c/0x130
Instruction dump:
7c0802a6 f8010010 60000000 7c0802a6 f8010010 f821ffe1 39200001 3d42001c
394a1660 912a0000 7c0004ac 39400000 38210020 e8010010 7c0803a6
---[ end trace 29d4d031bd0a9f44 ]---
Sending IPI to other CPUs
IPI complete
kexec: Starting switchover sequence.
Updated by hjluo about 4 years ago
./clone_job.sh 3797514 QEMURAM=4096
http://openqa.suse.de/t3857994 in this case it passed in 126.1 and failed at scc_registration, which is a known issue.
so we can enlarge the QEMURAM to avoid this kind of system panic.
Updated by hjluo about 4 years ago
- % Done changed from 20 to 40
solution: enlarge the QEMURAM to avoid this kind of system panic.
http://openqa.nue.suse.com/tests/3857994#step/install_service/282
Updated by hjluo about 4 years ago
- % Done changed from 40 to 60
for this issue, we'd just watch this for the coming new bits. we kdump just called echo c > /proc/sysreq to trigger the coredump.
we can't do much from the QA side.
Updated by hjluo about 4 years ago
- Related to action #61907: [kernel][ppc64le] Update kdump memory size for ppc64le added
Updated by rfan1 about 4 years ago
Whenever we trigger kernel crash with kdump server active.
#echo c > /proc/sysrq-trigger
system will panic and bump into kdump ramdisk to save core file and then reboot.
Sometimes, it often fails to save core due to:
- out of memory (we didn't reserve enough memory to catch up all crash info)
- out of disk space (disk space is too small to save core file)
- low band rate configuration (console=ttyS0,9600)
In these cases, we can do some modification for our kdump configuration.
- kdump memory configuration (we can adjust them)
crashkernel=165M,high crashkernel=72M,low
make sure you have enough disk space to save core file (default is /var/crash)
set the band rate to 115200
Updated by okurz about 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: online_sles15_pscc_basesys-srv-desk-dev-contm-lgm-wsm_all_full
https://openqa.suse.de/tests/3895073
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by hjluo about 4 years ago
Updated by hjluo about 4 years ago
currently it have 2 issue;
1) can't wait to grub menu in install_service we can enlarge the wait time to fix it.
mostly on ppc64le
https://openqa.nue.suse.com/tests/3959216#step/install_service/284
https://openqa.nue.suse.com/tests/3959208#step/install_service/282
2) on s390 it's a bug for can't connecting to domain.
Bug 1163000 [Build 136.2] openQA test fails in kdump_and_crash - System does not come back after crash
https://bugzilla.suse.com/show_bug.cgi?id=1163000
[Build 136.2] openQA test fails in kdump_and_crash - System does not come back after crash
https://openqa.nue.suse.com/tests/3958035#step/check_upgraded_service/1
https://openqa.nue.suse.com/tests/3958037#step/check_upgraded_service/1
https://openqa.nue.suse.com/tests/3958038#step/check_upgraded_service/197
Updated by hjluo about 4 years ago
- % Done changed from 60 to 80
Updated by hjluo about 4 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100