Project

General

Profile

Actions

action #166520

closed

[alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-09-09
Due date:
2024-09-24
% Done:

0%

Estimated time:

Description

Observation

qesapworker-prg5 seems to not come back from a reboot during the weekly maintenance window:
https://monitor.qa.suse.de/d/WDqesapworker-prg5/worker-dashboard-qesapworker-prg5?viewPanel=65105&orgId=1&from=1725721856750&to=1725820246008

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)Resolvedokurz2024-03-18

Actions
Related to openQA Infrastructure (public) - action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:MResolvedybonatakis

Actions
Actions #1

Updated by nicksinger 3 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger

No output after connecting via SOL, system also doesn't react to sysrq's over SOL so I had to power cycle the machine. Currently observing the boot process to see what is broken and after some time get:

[  OK  ] Mounted /boot/efi.
[   28.413545][   T19] intel_rapl_common: Found RAPL domain package
[   28.414306][ T1851] tg3 0000:04:00.0 em1: renamed from eth2
[   28.420668][   T19] intel_rapl_common: Found RAPL domain dram
[   28.433182][   T19] intel_rapl_common: DRAM domain energy unit 15300pj
[   28.433544][   T20] intel_rapl_common: Found RAPL domain package
[  OK     28.437020][ T2330] ------------[ cut here ]------------
0m] Finished    28.437026][ T2330] WARNING: CPU: 42 PID: 2330 at ../block/blk-lib.c:50 __blkdev_issue_discard+0x182/0x1e0
;1;39mFail if at[   28.437040][ T2330] Modules linked in: intel_rapl_msr(+) intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp bnxt_re ipmi_ssif ib_uverbs nls_iso8859_1 mgag200 dell_wmi nls_cp437 iTCO_wdt i2c_algo_bit ledtrig_audio intel_pmc_bxt ib_core vfat sparse_keymap drm_shmem_helper iTCO_vendor_support mfd_core drm_kms_helper kvm_intel rfkill fat dell_smbios raid0 syscopyarea video intel_th_gth(X) mei_me sysfillrect acpi_ipmi md_mod ipmi_si kvm dcdbas(X) tg3 intel_th_pci(X) isst_if_mmio i2c_i801 wmi_bmof sysimgblt isst_if_mbox_pci irqbypass pcspkr dell_wmi_descriptor ipmi_devintf libphy fb_sys_fops mei i2c_smbus isst_if_common intel_vsec intel_pch_thermal intel_th(X) bnxt_en ipmi_msghandler button fuse efi_pstore(N) drm configfs ip_tables x_tables sd_mod t10_pi crc64_rocksoft_generic xhci_pci crc64_rocksoft xhci_pci_renesas crc64 crc32_pclmul ghash_clmulni_intel xhci_hcd aesni_intel ahci libahci
 least …een recorded under /var/crash.
[   28.437182][ T2330]  crypto_simd cryptd libata usbcore megaraid_sas wmi btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs
[   28.437217][ T2330] Supported: No, Unsupported modules are loaded
[   28.437220][ T2330] CPU: 42 PID: 2330 Comm: mkfs.ext2 Tainted: G               X  N 5.14.21-150500.55.73-default #1 SLE15-SP5 fc1a4bbca9a564948b2766e2916ea63844b7cf3f
[   28.437225][ T2330] Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.10.2 03/03/2023
[   28.437229][ T2330] RIP: 0010:__blkdev_issue_discard+0x182/0x1e0
[   28.437233][ T2330] Code: 08 4c 89 28 31 c0 e9 f8 fe ff ff c1 e8 09 4c 89 f9 8d 70 ff 4c 09 f1 b8 ea ff ff ff 48 85 ce 0f 84 3e ff ff ff e9 d9 fe ff ff <0f> 0b 48 8d 74 24 10 e8 12 d3 00 00 48 c7 c6 60 64 cb b6 48 c7 c7
[   28.437237][ T2330] RSP: 0018:ff7982535b557d40 EFLAGS: 00010246
[   28.437242][ T2330] RAX: ff2c0b8e1367a0a0 RBX: ff2c0b8e16d91f40 RCX: ff2c0b8e16d91f40
[   28.437246][ T2330] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff2c0b8e16d91f40
[   28.437248][ T2330] RBP: ff2c0b8e16d91f40 R08: ff7982535b557db0 R09: 0000000000000000
[   28.437251][ T2330] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000008
[   28.437254][ T2330] R13: 0000000000000cc0 R14: 0000000000000008 R15: 00000000480e009f
[   28.437257][ T2330] FS:  00007fa2a80da780(0000) GS:ff2c0b8bbf940000(0000) knlGS:0000000000000000
[   28.437261][ T2330] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.437264][ T2330] CR2: 0000563522715ba8 CR3: 000000011cff2001 CR4: 0000000000771ee0
[   28.437267][ T2330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.437269][ T2330] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   28.437272][ T2330] PKRU: 55555554
[   28.437275][ T2330] Call Trace:
[   28.437281][ T2330]  <TASK>
[   28.437287][ T2330]  blkdev_issue_discard+0x51/0xb0
[   28.437292][ T2330]  blkdev_common_ioctl+0x394/0x960
[   28.437298][ T2330]  blkdev_ioctl+0x21c/0x2a0
[   28.437301][ T2330]  __x64_sys_ioctl+0x8f/0xd0
[   28.437310][ T2330]  do_syscall_64+0x58/0x80
[   28.437318][ T2330]  ? exc_page_fault+0x67/0x150
[   28.437323][ T2330]  entry_SYSCALL_64_after_hwframe+0x6b/0xd5
[   28.437332][ T2330] RIP: 0033:0x7fa2a81ea527
[   28.437336][ T2330] Code: 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 d9 0d 00 f7 d8 64 89 01 48
[   28.437339][ T2330] RSP: 002b:00007fffc8f69a58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   28.437344][ T2330] RAX: ffffffffffffffda RBX: 0000563522709bb0 RCX: 00007fa2a81ea527
[   28.437347][ T2330] RDX: 00007fffc8f69a60 RSI: 0000000000001277 RDI: 0000000000000003
[   28.437350][ T2330] RBP: 00007fa2a83ce500 R08: 0000000000000003 R09: 0000000000000000
[   28.437353][ T2330] R10: 00007fa2a8367a80 R11: 0000000000000246 R12: 0000000000000000
[   28.437355][ T2330] R13: 00000000df857f00 R14: 0000000000000000 R15: 0000000000000000
[   28.437359][ T2330]  </TASK>
[   28.437361][ T2330] ---[ end trace d326646dab372ce6 ]---
[   28.437365][ T2330] md127: Error: discard_granularity is 0.
[   28.875868][   T20] intel_rapl_common: Found RAPL domain dram
[   28.875873][   T20] intel_rapl_common: DRAM domain energy unit 15300pj

but this seems to be just a warning and the system continues booting successfully. Maybe we have a similar problem on that host as with https://progress.opensuse.org/issues/166169 ? Checking if I can reproduce

Actions #2

Updated by openqa_review 3 months ago

  • Due date set to 2024-09-24

Setting due date based on mean cycle time of SUSE QE Tools

Actions #3

Updated by livdywan 3 months ago

  • Subject changed from [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) to [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S
Actions #4

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Resolved

30/30 reboots were successful. The alert fired right after my experiment was conducted (therefore a false-positive alert).

Actions #5

Updated by okurz 3 months ago

  • Related to action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz) added
Actions #6

Updated by okurz 3 months ago

  • Related to action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M added
Actions

Also available in: Atom PDF