Project

General

Profile

Actions

action #157441

closed

osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)

Added by tinita 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Start date:
2024-03-18
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2398227

Date: Sun, 17 Mar 2024 05:49:14 +0000
Date: Mon, 18 Mar 2024 05:49:42 +0000

qesapworker-prg5.qa.suse.cz:
2184    Minion did not return. [Not connected]

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_qesapworker-prg5/view?orgId=1

The worker seemed to have hung up. No login prompt on serial tty.
Rebooted via IPMI.
Worker came up, but a systemd service failed: …

It seems like the NVMe disk is not found anymore. Maybe it died and the system subsequently freezed.

Acceptance criteria

  • AC1: osd-deployment passed again
  • AC2: qesapworker-prg5.qa.suse.cz back in production again

Suggestions

Rollback steps


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #157453: [FIRING:1] host_up (qesapworker-prg5: host up alert openQA qesapworker-prg5 host_up_alert_qesapworker-prg5 worker)Rejectedokurz2024-03-18

Actions
Related to openQA Infrastructure (public) - action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:SResolvednicksinger2024-09-092024-09-24

Actions
Related to openQA Infrastructure (public) - action #164907: [alert][FIRING:1] host_up (qesapworker-prg5: host up alert openQA, qesapworker-prg5-mgmt.qa.suse.cz not reachable, failing osd-deploymentResolvedokurz2024-08-04

Actions
Copied to openQA Infrastructure (public) - action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:MResolvedybonatakis

Actions
Actions #1

Updated by tinita 9 months ago

  • Description updated (diff)
Actions #2

Updated by tinita 9 months ago

  • Priority changed from Normal to High
Actions #3

Updated by okurz 9 months ago

  • Tags set to infra, reactive work
  • Assignee set to okurz
  • Priority changed from High to Urgent
Actions #4

Updated by okurz 9 months ago

  • Related to action #157453: [FIRING:1] host_up (qesapworker-prg5: host up alert openQA qesapworker-prg5 host_up_alert_qesapworker-prg5 worker) added
Actions #5

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #6

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #7

Updated by okurz 9 months ago

# systemctl status openqa_nvme_format.service
. openqa_nvme_format.service - Setup NVMe before mounting it
     Loaded: loaded (/etc/systemd/system/openqa_nvme_format.service; disabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-03-18 11:01:18 CET; 3min 30s ago
    Process: 31734 ExecStart=/usr/local/bin/openqa-establish-nvme-setup (code=exited, status=1/FAILURE)
   Main PID: 31734 (code=exited, status=1/FAILURE)

Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │                                /boot/grub2/i386-pc
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │                                /.snapshots
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: │                                /
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31739]: └─sda3   8:3    0     1G  0 part [SWAP]
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31734]: Creating RAID0 "/dev/md/openqa" on: /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31748]: mdadm: cannot open /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e: No such file or directory
Mar 18 11:01:17 qesapworker-prg5 openqa-establish-nvme-setup[31734]: Unable to create RAID, mdadm returned with non-zero code
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Mar 18 11:01:18 qesapworker-prg5 systemd[1]: Failed to start Setup NVMe before mounting it.
Actions #8

Updated by okurz 9 months ago

  • Status changed from New to In Progress
Actions #9

Updated by okurz 9 months ago

  • Description updated (diff)
Actions #10

Updated by okurz 9 months ago

  • Description updated (diff)
  • Status changed from In Progress to Resolved

I removed the machine from OSD salt keys and retriggered the OSD pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1043545

On qesapworker-prg5 I triggered a reboot and to check if maybe a complete power cycle helps I did ipmitool … power cycle and ipmitool … sol activate and I saw a complete successful reboot. It seems the power cycle help as the device is back. Putting machine back into production. Meanwhile https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1043545 is good again. salt state was cleanly applied, openQA workers are running fine again. alerts back to good as well.

Actions #11

Updated by livdywan 3 months ago

  • Copied to action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M added
Actions #12

Updated by okurz 3 months ago

  • Related to action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S added
Actions #13

Updated by okurz 3 months ago

  • Related to action #164907: [alert][FIRING:1] host_up (qesapworker-prg5: host up alert openQA, qesapworker-prg5-mgmt.qa.suse.cz not reachable, failing osd-deployment added
Actions

Also available in: Atom PDF