Project

General

Profile

Actions

action #62267

closed

[sle][Migration][SLE15SP2][Regression] test fails in install_service - kdump lead system crash

Added by leli over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2020-01-19
Due date:
% Done:

100%

Estimated time:
10.00 h
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP2-Regression-on-Migration-from-SLE12-SP5-to-SLE15-SP2-ppc64le-offline_sles12sp5_pscc_sdk-lp-asmm-contm-lgm-tcm-wsm_all_full@ppc64le fails in
install_service

Test suite description

Reproducible

Fails since (at least) Build 104.1

Expected result

Last good: (unknown) (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Tests - action #61907: [kernel][ppc64le] Update kdump memory size for ppc64leResolvedpcervinka2020-01-08

Actions
Actions #1

Updated by coolgw over 4 years ago

  • Subject changed from [sle][Migration][SLE15SP2] test fails in install_service - cause exceed 2h timout of the job to [sle][Migration][SLE15SP2] test fails in install_service - kdump lead system crash
Actions #2

Updated by leli over 4 years ago

  • Assignee set to hjluo

We can set much more value for QEMURAM to see whether can fix this issue.

Actions #3

Updated by leli over 4 years ago

  • Subject changed from [sle][Migration][SLE15SP2] test fails in install_service - kdump lead system crash to [sle][Migration][SLE15SP2][Regression] test fails in install_service - kdump lead system crash
Actions #4

Updated by hjluo about 4 years ago

  • Status changed from New to In Progress
Actions #5

Updated by leli about 4 years ago

Related with poo#61943.

Actions #6

Updated by leli about 4 years ago

  • Priority changed from Normal to High
Actions #7

Updated by hjluo about 4 years ago

it passed on 134.1 https://openqa.nue.suse.com/tests/3853705.

it looks like kernel crash with accessing invalid address.
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048
NUMA
pSeries
Modules linked in: fuse nf_log_ipv6 xt_comment nf_log_ipv4 nf_log_common xt_LOG xt_limit scsi_transport_iscsi af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_pkttype xt_tcpudp iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack libcrc32c ip6table_filter ip6_tables x_tables joydev vmx_crypto gf128mul crct10dif_vpmsum rtc_generic virtio_net net_failover failover kgraft_patch_1_8_5_1(OK) hid_generic usbhid btrfs xor zstd_decompress zstd_compress xxhash sr_mod cdrom raid6_pq virtio_blk virtio_scsi virtio_console bochs_drm drm_kms_helper crc32c_vpmsum syscopyarea sysfillrect sysimgblt fb_sys_fops ttm xhci_pci xhci_hcd drm usbcore drm_panel_orientation_quirks agpgart virtio_pci virtio_ring virtio sg dm_multipath dm_mod scsi_dh_rdac
scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: Yes
CPU: 0 PID: 4549 Comm: bash Tainted: G O K 4.12.14-122.12-default #1 SLE12-SP5
task: c000000038793a00 task.stack: c000000038ef4000
NIP: c0000000006754d8 LR: c000000000676660 CTR: c0000000006754a0
REGS: c000000038ef7960 TRAP: 0300 Tainted: G O K (4.12.14-122.12-default)
MSR: 8000000000009033
CR: 22424422 XER: 00000000
CFAR: c000000000002458 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
GPR00: c000000000676660 c000000038ef7be0 c0000000011d1f00 0000000000000063
GPR04: c00000003fe0ade8 c00000003fe22a00 7269676765722061 2063726173680d0a
GPR08: 0000000000000000 0000000000000001 0000000000000000 0000000000000172
GPR12: c0000000006754a0 c0000000076c0000 0000000010063538 0000010032ff7560
GPR16: 0000000000000000 0000000000000000 0000010032e239a0 00000000100f6c68
GPR20: 00000000100bf5b8 0000000000000000 00000000100f7848 0000010032ffd600
GPR24: 00000000100622e8 0000000000000000 0000000000000001 0000000000000000
GPR28: c0000000010c1d40 0000000000000063 c000000001091d0c c0000000010c2100
NIP [c0000000006754d8] sysrq_handle_crash+0x38/0x50
LR [c000000000676660] __handle_sysrq+0xf0/0x270
Call Trace:
[c000000038ef7be0] [c0000000009a88f4] printk+0x44/0x58 (unreliable)
[c000000038ef7c00] [c000000000676660] __handle_sysrq+0xf0/0x270
[c000000038ef7ca0] [c000000000676e1c] write_sysrq_trigger+0x6c/0xb0
[c000000038ef7cd0] [c00000000045829c] proc_reg_write+0x8c/0xd0
[c000000038ef7d00] [c0000000003aef8c] __vfs_write+0x4c/0x210
[c000000038ef7d90] [c0000000003b0e54] vfs_write+0xd4/0x250
[c000000038ef7de0] [c0000000003b2c6c] SyS_write+0x6c/0x110
[c000000038ef7e30] [c00000000000b188] system_call+0x3c/0x130
Instruction dump:
7c0802a6 f8010010 60000000 7c0802a6 f8010010 f821ffe1 39200001 3d42001c
394a1660 912a0000 7c0004ac 39400000 38210020 e8010010 7c0803a6
---[ end trace 29d4d031bd0a9f44 ]---

Sending IPI to other CPUs
IPI complete
kexec: Starting switchover sequence.

Actions #8

Updated by hjluo about 4 years ago

  • % Done changed from 0 to 20
Actions #9

Updated by hjluo about 4 years ago

./clone_job.sh 3797514 QEMURAM=4096

http://openqa.suse.de/t3857994 in this case it passed in 126.1 and failed at scc_registration, which is a known issue.
so we can enlarge the QEMURAM to avoid this kind of system panic.

Actions #10

Updated by hjluo about 4 years ago

  • % Done changed from 20 to 40

solution: enlarge the QEMURAM to avoid this kind of system panic.
http://openqa.nue.suse.com/tests/3857994#step/install_service/282

Actions #11

Updated by hjluo about 4 years ago

Actions #12

Updated by hjluo about 4 years ago

  • % Done changed from 40 to 60

for this issue, we'd just watch this for the coming new bits. we kdump just called echo c > /proc/sysreq to trigger the coredump.
we can't do much from the QA side.

Actions #13

Updated by hjluo about 4 years ago

  • Related to action #61907: [kernel][ppc64le] Update kdump memory size for ppc64le added
Actions #14

Updated by rfan1 about 4 years ago

Whenever we trigger kernel crash with kdump server active.
#echo c > /proc/sysrq-trigger

system will panic and bump into kdump ramdisk to save core file and then reboot.

Sometimes, it often fails to save core due to:

  1. out of memory (we didn't reserve enough memory to catch up all crash info)
  2. out of disk space (disk space is too small to save core file)
  3. low band rate configuration (console=ttyS0,9600)

In these cases, we can do some modification for our kdump configuration.

  1. kdump memory configuration (we can adjust them)

crashkernel=165M,high crashkernel=72M,low

  1. make sure you have enough disk space to save core file (default is /var/crash)

  2. set the band rate to 115200

Actions #15

Updated by okurz about 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: online_sles15_pscc_basesys-srv-desk-dev-contm-lgm-wsm_all_full
https://openqa.suse.de/tests/3895073

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #16

Updated by leli about 4 years ago

  • Priority changed from High to Urgent
Actions #18

Updated by hjluo about 4 years ago

currently it have 2 issue;
1) can't wait to grub menu in install_service we can enlarge the wait time to fix it.

mostly on ppc64le
https://openqa.nue.suse.com/tests/3959216#step/install_service/284
https://openqa.nue.suse.com/tests/3959208#step/install_service/282

2) on s390 it's a bug for can't connecting to domain.
Bug 1163000 [Build 136.2] openQA test fails in kdump_and_crash - System does not come back after crash
https://bugzilla.suse.com/show_bug.cgi?id=1163000
[Build 136.2] openQA test fails in kdump_and_crash - System does not come back after crash
https://openqa.nue.suse.com/tests/3958035#step/check_upgraded_service/1
https://openqa.nue.suse.com/tests/3958037#step/check_upgraded_service/1
https://openqa.nue.suse.com/tests/3958038#step/check_upgraded_service/197

Actions #20

Updated by hjluo about 4 years ago

  • % Done changed from 80 to 90

PR merged

Actions #21

Updated by hjluo about 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100
Actions

Also available in: Atom PDF