Project

General

Profile

Actions

action #157441

closed

osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)

Added by tinita about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227

Date: Sun, 17 Mar 2024 05:49:14 +0000
Date: Mon, 18 Mar 2024 05:49:42 +0000

qesapworker-prg5.qa.suse.cz:
2184    Minion did not return. [Not connected]

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1

The worker seemed to have hung up. No login prompt on serial tty.
Rebooted via IPMI.
Worker came up, but a systemd service failed: …

It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed.

Acceptance criteria

  • AC1: osd-deployment passed again
  • AC2: qesapworker-prg5.qa.suse.cz back in production again

Suggestions

Rollback steps


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #157453: [FIRING:1] host_up (qesapworker-prg5: host up alert openQA qesapworker-prg5 host_up_alert_qesapworker-prg5 worker)Rejectedokurz2024-03-18

Actions
Actions #1

Updated by tinita about 2 months ago

  • Description updated (diff)
Actions #2

Updated by tinita about 2 months ago

  • Priority changed from Normal to High
Actions #3

Updated by okurz about 2 months ago

  • Tags set to infra, reactive work
  • Assignee set to okurz
  • Priority changed from High to Urgent
Actions #4

Updated by okurz about 2 months ago

  • Related to action #157453: [FIRING:1] host_up (qesapworker-prg5: host up alert openQA qesapworker-prg5 host_up_alert_qesapworker-prg5 worker) added
Actions #5

Updated by okurz about 2 months ago

  • Description updated (diff)
Actions #6

Updated by okurz about 2 months ago

  • Description updated (diff)
Actions #7

Updated by okurz about 2 months ago

# systemctl status openqa_nvme_format.service
. openqa_nvme_format.service - Setup NVMe before mounting it
     Loaded: loaded (/etc/systemd/system/openqa_nvme_format.service; disabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-03-18 11:01:18 CET; 3min 30s ago
    Process: 31734 ExecStart=/usr/local/bin/openqa-establish-nvme-setup (code=exited, status=1/FAILURE)
   Main PID: 31734 (code=exited, status=1/FAILURE)

Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │                                /boot/grub2/i386-pc
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │                                /.snapshots
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │                                /
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: └─sda3   8:3    0     1G  0 part [SWAP]
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31734]: Creating RAID0 "/dev/md/openqa" on: /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31748]: mdadm: cannot open /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e: No such file or directory
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31734]: Unable to create RAID, mdadm returned with non-zero code
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: Failed to start Setup NVMe before mounting it.
Actions #8

Updated by okurz about 2 months ago

  • Status changed from New to In Progress
Actions #9

Updated by okurz about 2 months ago

  • Description updated (diff)
Actions #10

Updated by okurz about 2 months ago

  • Description updated (diff)
  • Status changed from In Progress to Resolved

I removed the machine from OSD salt keys and retriggered the OSD pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1043545

On qesapworker-prg5 I triggered a reboot and to check if maybe a complete power cycle helps I did ipmitool … power cycle and ipmitool … sol activate and I saw a complete successful reboot. It seems the power cycle help as the device is back. Putting machine back into production. Meanwhile https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1043545 is good again. salt state was cleanly applied, openQA workers are running fine again. alerts back to good as well.

Actions

Also available in: Atom PDF