Project

General

Profile

Actions

action #164907

closed

[alert][FIRING:1] host_up (qesapworker-prg5: host up alert openQA, qesapworker-prg5-mgmt.qa.suse.cz not reachable, failing osd-deployment

Added by okurz 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-08-04
Due date:
% Done:

0%

Estimated time:

Description

Observation

Since
https://stats.openqa-monitor.qa.suse.de/d/WDqesapworker-prg5/worker-dashboard-qesapworker-prg5?orgId=1&viewPanel=65105&from=1722734477390&to=1722735702023
the weekly reboot maintenance window the machine did not come up again. No response on "sol activate". https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2907431 failed due to this.

Rollback steps

  • ssh osd "sudo salt-key -y -a qesapworker-prg5.qa.suse.cz && sudo salt --no-color --state-output=changes 'qesapworker-prg5*' state.apply | grep -v Result"

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)Resolvedokurz2024-03-18

Actions
Actions #1

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 4 months ago

power reset helped to recover but systemctl --failed:

  UNIT                                             LOAD   ACTIVE SUB    DESCRIPTION                                            
● automount-restarter@var-lib-openqa-share.service loaded failed failed Restarts the automount unit var-lib-openqa-share
● kdump-early.service                              loaded failed failed Load kdump kernel early on startup
● kdump.service                                    loaded failed failed Load kdump kernel and initrd
● openqa_nvme_format.service                       loaded failed failed Setup NVMe before mounting it
● smartd.service                                   loaded failed failed Self Monitoring and Reporting Technology (SMART) Daemon

also cat /proc/cmdline does not mention crashkernel. Where is the parameter gone?

Actions #3

Updated by openqa_review 4 months ago

  • Due date set to 2024-08-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by okurz 4 months ago

  • Due date deleted (2024-08-19)
  • Status changed from In Progress to Resolved

The second storage device was missing. I did poweroff and then impitool … power on and the system booted just fine.

Now fdisk -l shows

Disk /dev/sda: 13.97 TiB, 15360413663232 bytes, 30000807936 sectors
Disk model: PERC H755 Adp

is back. All services good as well.

Conducted rollback steps and verified that there are no related alerts.

Actions #5

Updated by okurz about 2 months ago

  • Related to action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz) added
Actions

Also available in: Atom PDF