action #167164: osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167164

closed

osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M

Added by livdywan 5 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

ybonatakis

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work, next-prague-office-visit

Description

Observation¶

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3125109

qesapworker-prg5.qa.suse.cz:
    Minion did not return. [Not connected]

https://stats.openqa-monitor.qa.suse.de/d/WDqesapworker-prg5/worker-dashboard-qesapworker-prg5?orgId=1&viewPanel=65105

Acceptance criteria¶

AC1: osd-deployment passed again
AC2: qesapworker-prg5.qa.suse.cz responsive and able to process jobs

Suggestions¶

Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
Remove qesapworker-prg5.qa.suse.cz from production ssh osd "sudo salt-key -y -d qesapworker-prg5.qa.suse.cz"
Retrigger failed osd deployment CI pipeline
Investigate the specific issue on qesapworker-prg5.qa.suse.cz
Fix any potential hardware issue, e.g. with hardware replacement
Ensure qesapworker-prg5.qa.suse.cz is usable in production

Files

clipboard-202409271831-rfwpi.png (137 KB) clipboard-202409271831-rfwpi.png

ybonatakis, 2024-09-27 16:31

Related issues 5 (0 open — 5 closed)

Related to openQA Infrastructure (public) - action #166169: Failed systemd services on worker31 / osd size:M

Resolved

dheidler

2024-07-09

2024-09-17

Actions

Related to openQA Infrastructure (public) - action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S

Resolved

nicksinger

2024-09-09

2024-09-24

Actions

Related to openQA Infrastructure (public) - action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server

Resolved

okurz

2024-09-28

Actions

Related to openQA Infrastructure (public) - action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response"

Resolved

okurz

2025-01-16

Actions

Copied from openQA Infrastructure (public) - action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)

Resolved

okurz

2024-03-18

Actions

Copy link

Updated by livdywan 5 months ago

Copied from action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz) added

Actions

Copy link

Updated by okurz 5 months ago

Assignee changed from okurz to ybonatakis

as discussed ybonatakis will follow up and potentially as needed from the PRG office.

Actions

Copy link

Updated by ybonatakis 5 months ago

Status changed from New to In Progress

Steps to bring it up:

ipmitool -I lanplus -H qesapworker-prg5-mgmt.qa.suse.cz -U <> -P *** power reset I used the following cmds to verify machine is pingable
iob@openqa:~> sudo salt qesapworker-prg5.qa.suse.cz test.ping
ping -c2 qesapworker-prg5.qa.suse.cz

Journal logs from 22/09
https://gist.github.com/b10n1k/0e584da3b3a6d73e9d0dae42f5eee09b

CI retriggered and passes https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3125107

Actions

Copy link

Updated by ybonatakis 5 months ago · Edited

https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1323448 run and everything look passing. but the qesapworker-prg5 workers are still offline.

iob@openqa:~> sudo salt -l debug qesapworker-prg5.qa.suse.cz cmd.run 'systemctl status openqa-worker-auto-restart@*.service'
iob@openqa:~> sudo salt -l debug qesapworker-prg5.qa.suse.cz cmd.run 'systemctl restart openqa-worker-auto-restart@*.service'

doesnt say much, but it has something like [DEBUG ] return event: {'qesapworker-prg5.qa.suse.cz': {'ret': '', 'retcode': 0, 'jid': '20240923120727551566'}}

Dig in further and found

iob@qesapworker-prg5:~> sudo journalctl -u openqa-worker-auto-restart@1.service --since="today"
Sep 23 09:31:15 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 09:31:15 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
Sep 23 10:33:05 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 10:33:05 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
Sep 23 11:33:02 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 11:33:02 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
iob@qesapworker-prg5:~> sudo systemctl list-units --failed
  UNIT                                             LOAD   ACTIVE SUB    DESCRIPTION                                            
● automount-restarter@var-lib-openqa-share.service loaded failed failed Restarts the automount unit var-lib-openqa-share
● openqa_nvme_format.service                       loaded failed failed Setup NVMe before mounting it
● smartd.service                                   loaded failed failed Self Monitoring and Reporting Technology (SMART) Daemon

iob@qesapworker-prg5:~> sudo systemctl status openqa_nvme_format.service
× openqa_nvme_format.service - Setup NVMe before mounting it
     Loaded: loaded (/etc/systemd/system/openqa_nvme_format.service; disabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-09-23 12:13:36 UTC; 16min ago
    Process: 39448 ExecStart=/usr/local/bin/openqa-establish-nvme-setup (code=exited, status=1/FAILURE)
   Main PID: 39448 (code=exited, status=1/FAILURE)

Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │                                /home
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │                                /.snapshots
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │                                /
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: └─sda3   8:3    0     1G  0 part [SWAP]
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39448]: Creating RAID0 "/dev/md/openqa" on: /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39458]: mdadm: cannot open /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e: No such file or directory
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39448]: Unable to create RAID, mdadm returned with non-zero code
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: Failed to start Setup NVMe before mounting it.

iob@qesapworker-prg5:~> sudo systemctl status automount-restarter@var-lib-openqa-share.service
× automount-restarter@var-lib-openqa-share.service - Restarts the automount unit var-lib-openqa-share
     Loaded: loaded (/etc/systemd/system/automount-restarter@.service; static)
     Active: failed (Result: exit-code) since Mon 2024-09-23 09:31:26 UTC; 3h 0min ago
   Main PID: 3390 (code=exited, status=1/FAILURE)

Sep 23 09:31:16 qesapworker-prg5 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Sep 23 09:31:26 qesapworker-prg5 bash[3390]: A dependency job for var-lib-openqa-share.automount failed. See 'journalctl -xe' for details.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Failed with result 'exit-code'.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: Failed to start Restarts the automount unit var-lib-openqa-share.
iob@qesapworker-prg5:~> sudo journalctl -xeu automount-restarter@var-lib-openqa-share.service
░░ A start job for unit automount-restarter@var-lib-openqa-share.service has begun execution.
░░ 
░░ The job identifier is 421.
Sep 23 09:31:26 qesapworker-prg5 bash[3390]: A dependency job for var-lib-openqa-share.automount failed. See 'journalctl -xe' for details.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit automount-restarter@var-lib-openqa-share.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit automount-restarter@var-lib-openqa-share.service has entered the 'failed' state with result 'exit-code'.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: Failed to start Restarts the automount unit var-lib-openqa-share.
░░ Subject: A start job for unit automount-restarter@var-lib-openqa-share.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit automount-restarter@var-lib-openqa-share.service has finished with a failure.
░░ 
░░ The job identifier is 421 and the job result is failed.

Actions

Copy link

Updated by ybonatakis 5 months ago

Related to action #166169: Failed systemd services on worker31 / osd size:M added

Actions

Copy link

Updated by ybonatakis 5 months ago · Edited

I linked #166169 which as pointed by @nicksinger seems relevant

Actions

Copy link

Updated by livdywan 5 months ago

Did you consider taking qesapworker-prg5.qa.suse.cz out of salt for now? While units are failed pipelines will continue to fail.

Actions

Copy link

Updated by openqa_review 5 months ago

Due date set to 2024-10-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by ybonatakis 5 months ago

livdywan wrote in #note-7:

Did you consider taking qesapworker-prg5.qa.suse.cz out of salt for now? While units are failed pipelines will continue to fail.

I did but not because the pipelines. https://gitlab.suse.de/openqa/osd-deployment/-/pipelines dont seem to cause failures.

Actions

Copy link

#10

Updated by nicksinger 5 months ago

I removed the machine from production following https://progress.opensuse.org/projects/openqav3/wiki#Take-machines-out-of-salt-controlled-production because it failed deployment: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/3137418#L789

Actions

Copy link

#11

Updated by okurz 5 months ago

Subject changed from osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) to osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M

Actions

Copy link

#12

Updated by okurz 5 months ago

Tags changed from infra, reactive work, alert to infra, reactive work, alert, next-office-day, next-prague-office-visit

I checked lsblk on qesapworker-prg5 and found same as in previous tickets that the second storage device is missing. As this now seemed to happen multiple times. As discussed please physically ensure proper seating of the storage devices, i.e. pull them out and plug them back in.

Actions

Copy link

#13

Updated by ybonatakis 5 months ago

Tags changed from infra, reactive work, alert, next-office-day, next-prague-office-visit to infra, reactive work, alert
File clipboard-202409271831-rfwpi.png clipboard-202409271831-rfwpi.png added
Status changed from In Progress to Feedback

I have no idea if someone else did something before me.
I follow suggestion to reboot the iDRAC and then I had to turn the system on.

In the meantime I was trying to find how to update the firmware (iDRAC) but I didnt have that option. I logged in with creds from workerconf.sls. But trying to find something regarding this I navigated to the storage where I saw

I will leave it to it and check again on Monday

Actions

Copy link

#14

Updated by okurz 5 months ago

Related to action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S added

Actions

Copy link

#15

Updated by okurz 5 months ago

For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.

Actions

Copy link

#16

Updated by okurz 5 months ago

Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added

Actions

Copy link

#17

Updated by ybonatakis 5 months ago

okurz wrote in #note-15:

For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.

i was thinnking to make them available again when I saw your comment. Can I try? or what do you expect as next step?

Actions

Copy link

#18

Updated by okurz 5 months ago

ybonatakis wrote in #note-17:

okurz wrote in #note-15:

For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.

i was thinnking to make them available again when I saw your comment. Can I try? or what do you expect as next step?

well, as the same problem seems to have happened multiple times I still suggest you readd the tags you removed in #167164-13 and physically ensure proper seating of storage device connectors.

Actions

Copy link

#19

Updated by ybonatakis 5 months ago

Tags changed from infra, reactive work, alert to infra, reactive work, alert, next-prague-office-visit

just add the next-prague-office-visit which incidentally removed on friday

Actions

Copy link

#20

Updated by okurz 5 months ago

Priority changed from High to Normal

Actions

Copy link

#21

Updated by nicksinger 5 months ago

I've updated the following components:

iDRAC from 7.00.(something) to 7.10.70.00
SAS RAID Firmware from 52.16.1-4405 to 52.26.0-5179
BIOS from 1.10.2 to 1.15.2

I also cleared the iDRAC logs so you have a clean starting point to validate if the machine still has problems detecting its disks.

Actions

Copy link

#22

Updated by ybonatakis 5 months ago · Edited

I just turned on the system after @nicksinger updated. Disk showed up and all look normal

Actions

Copy link

#23

Updated by ybonatakis 5 months ago

qesapworker-prg5.qa.suse.cz workers are online on osd. Also checked https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1&var-all_machines=qesapworker-prg5.qa.suse.cz

Actions

Copy link

#24

Updated by ybonatakis 5 months ago

Status changed from Feedback to Resolved

Oli said that the worker class is disable on most of the workers. but this is related to another ticket(poo#??). resolved as per last actions from Nick and as we have the machine stable running in OSD. Disks are visible and we move to monitoring stage.

Actions

Copy link

#25

Updated by okurz 5 months ago

Due date deleted (~~2024-10-08~~)

Actions

Copy link

#26

Updated by jbaier_cz about 1 month ago

Related to action #175740: [alert] deploy pipeline for salt-states-openqa failed, multiple host run into salt error "Not connected" or "No response" added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167164

osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan 5 months ago

Updated by okurz 5 months ago

Updated by ybonatakis 5 months ago

Updated by ybonatakis 5 months ago · Edited

Updated by ybonatakis 5 months ago

Updated by ybonatakis 5 months ago · Edited

Updated by livdywan 5 months ago

Updated by openqa_review 5 months ago

Updated by ybonatakis 5 months ago

Updated by nicksinger 5 months ago

Updated by okurz 5 months ago

Updated by okurz 5 months ago

Updated by ybonatakis 5 months ago

Updated by okurz 5 months ago

Updated by okurz 5 months ago

Updated by okurz 5 months ago

Updated by ybonatakis 5 months ago

Updated by okurz 5 months ago

Updated by ybonatakis 5 months ago

Updated by okurz 5 months ago

Updated by nicksinger 5 months ago

Updated by ybonatakis 5 months ago · Edited

Updated by ybonatakis 5 months ago

Updated by ybonatakis 5 months ago

Updated by okurz 5 months ago

Updated by jbaier_cz about 1 month ago