Project

General

Profile

Actions

action #167164

closed

osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M

Added by livdywan 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3125109

qesapworker-prg5.qa.suse.cz:
    Minion did not return. [Not connected]

https://stats.openqa-monitor.qa.suse.de/d/WDqesapworker-prg5/worker-dashboard-qesapworker-prg5?orgId=1&viewPanel=65105

Acceptance criteria

  • AC1: osd-deployment passed again
  • AC2: qesapworker-prg5.qa.suse.cz responsive and able to process jobs

Suggestions


Files


Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #166169: Failed systemd services on worker31 / osd size:MResolveddheidler2024-07-092024-09-17

Actions
Related to openQA Infrastructure (public) - action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:SResolvednicksinger2024-09-092024-09-24

Actions
Related to openQA Infrastructure (public) - action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket serverResolvedokurz2024-09-28

Actions
Copied from openQA Infrastructure (public) - action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz)Resolvedokurz2024-03-18

Actions
Actions #1

Updated by livdywan 3 months ago

  • Copied from action #157441: osd-deployment | Failed pipeline for master (qesapworker-prg5.qa.suse.cz) added
Actions #2

Updated by okurz 3 months ago

  • Assignee changed from okurz to ybonatakis

as discussed ybonatakis will follow up and potentially as needed from the PRG office.

Actions #3

Updated by ybonatakis 3 months ago

  • Status changed from New to In Progress

Steps to bring it up:

  • ipmitool -I lanplus -H qesapworker-prg5-mgmt.qa.suse.cz -U <> -P *** power reset I used the following cmds to verify machine is pingable
  • iob@openqa:~> sudo salt qesapworker-prg5.qa.suse.cz test.ping
  • ping -c2 qesapworker-prg5.qa.suse.cz

Journal logs from 22/09
https://gist.github.com/b10n1k/0e584da3b3a6d73e9d0dae42f5eee09b

CI retriggered and passes https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3125107

Actions #4

Updated by ybonatakis 3 months ago · Edited

https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1323448 run and everything look passing. but the qesapworker-prg5 workers are still offline.

iob@openqa:~> sudo salt -l debug qesapworker-prg5.qa.suse.cz cmd.run 'systemctl status openqa-worker-auto-restart@*.service'
iob@openqa:~> sudo salt -l debug qesapworker-prg5.qa.suse.cz cmd.run 'systemctl restart openqa-worker-auto-restart@*.service'

doesnt say much, but it has something like [DEBUG ] return event: {'qesapworker-prg5.qa.suse.cz': {'ret': '', 'retcode': 0, 'jid': '20240923120727551566'}}

Dig in further and found

iob@qesapworker-prg5:~> sudo journalctl -u openqa-worker-auto-restart@1.service --since="today"
Sep 23 09:31:15 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 09:31:15 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
Sep 23 10:33:05 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 10:33:05 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
Sep 23 11:33:02 qesapworker-prg5 systemd[1]: Dependency failed for openQA Worker #1.
Sep 23 11:33:02 qesapworker-prg5 systemd[1]: openqa-worker-auto-restart@1.service: Job openqa-worker-auto-restart@1.service/start failed with result 'dependency'.
iob@qesapworker-prg5:~> sudo systemctl list-units --failed
  UNIT                                             LOAD   ACTIVE SUB    DESCRIPTION                                            
● automount-restarter@var-lib-openqa-share.service loaded failed failed Restarts the automount unit var-lib-openqa-share
● openqa_nvme_format.service                       loaded failed failed Setup NVMe before mounting it
● smartd.service                                   loaded failed failed Self Monitoring and Reporting Technology (SMART) Daemon

iob@qesapworker-prg5:~> sudo systemctl status openqa_nvme_format.service
× openqa_nvme_format.service - Setup NVMe before mounting it
     Loaded: loaded (/etc/systemd/system/openqa_nvme_format.service; disabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-09-23 12:13:36 UTC; 16min ago
    Process: 39448 ExecStart=/usr/local/bin/openqa-establish-nvme-setup (code=exited, status=1/FAILURE)
   Main PID: 39448 (code=exited, status=1/FAILURE)

Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │                                /home
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │                                /.snapshots
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: │                                /
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39451]: └─sda3   8:3    0     1G  0 part [SWAP]
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39448]: Creating RAID0 "/dev/md/openqa" on: /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39458]: mdadm: cannot open /dev/disk/by-id/scsi-SDELL_PERC_H755_Adp_00e7176dba09d4532c00f9c13280e04e: No such file or directory
Sep 23 12:13:36 qesapworker-prg5 openqa-establish-nvme-setup[39448]: Unable to create RAID, mdadm returned with non-zero code
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: openqa_nvme_format.service: Failed with result 'exit-code'.
Sep 23 12:13:36 qesapworker-prg5 systemd[1]: Failed to start Setup NVMe before mounting it.

iob@qesapworker-prg5:~> sudo systemctl status automount-restarter@var-lib-openqa-share.service
× automount-restarter@var-lib-openqa-share.service - Restarts the automount unit var-lib-openqa-share
     Loaded: loaded (/etc/systemd/system/automount-restarter@.service; static)
     Active: failed (Result: exit-code) since Mon 2024-09-23 09:31:26 UTC; 3h 0min ago
   Main PID: 3390 (code=exited, status=1/FAILURE)

Sep 23 09:31:16 qesapworker-prg5 systemd[1]: Starting Restarts the automount unit var-lib-openqa-share...
Sep 23 09:31:26 qesapworker-prg5 bash[3390]: A dependency job for var-lib-openqa-share.automount failed. See 'journalctl -xe' for details.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Failed with result 'exit-code'.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: Failed to start Restarts the automount unit var-lib-openqa-share.
iob@qesapworker-prg5:~> sudo journalctl -xeu automount-restarter@var-lib-openqa-share.service
░░ A start job for unit automount-restarter@var-lib-openqa-share.service has begun execution.
░░ 
░░ The job identifier is 421.
Sep 23 09:31:26 qesapworker-prg5 bash[3390]: A dependency job for var-lib-openqa-share.automount failed. See 'journalctl -xe' for details.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ An ExecStart= process belonging to unit automount-restarter@var-lib-openqa-share.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: automount-restarter@var-lib-openqa-share.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit automount-restarter@var-lib-openqa-share.service has entered the 'failed' state with result 'exit-code'.
Sep 23 09:31:26 qesapworker-prg5 systemd[1]: Failed to start Restarts the automount unit var-lib-openqa-share.
░░ Subject: A start job for unit automount-restarter@var-lib-openqa-share.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit automount-restarter@var-lib-openqa-share.service has finished with a failure.
░░ 
░░ The job identifier is 421 and the job result is failed.

Actions #5

Updated by ybonatakis 3 months ago

  • Related to action #166169: Failed systemd services on worker31 / osd size:M added
Actions #6

Updated by ybonatakis 3 months ago · Edited

I linked #166169 which as pointed by @nicksinger seems relevant

Actions #7

Updated by livdywan 3 months ago

Did you consider taking qesapworker-prg5.qa.suse.cz out of salt for now? While units are failed pipelines will continue to fail.

Actions #8

Updated by openqa_review 3 months ago

  • Due date set to 2024-10-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by ybonatakis 3 months ago

livdywan wrote in #note-7:

Did you consider taking qesapworker-prg5.qa.suse.cz out of salt for now? While units are failed pipelines will continue to fail.

I did but not because the pipelines. https://gitlab.suse.de/openqa/osd-deployment/-/pipelines dont seem to cause failures.

Actions #11

Updated by okurz 3 months ago

  • Subject changed from osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) to osd-deployment | Minions returned with non-zero exit code (qesapworker-prg5.qa.suse.cz) size:M
Actions #12

Updated by okurz 3 months ago

  • Tags changed from infra, reactive work, alert to infra, reactive work, alert, next-office-day, next-prague-office-visit

I checked lsblk on qesapworker-prg5 and found same as in previous tickets that the second storage device is missing. As this now seemed to happen multiple times. As discussed please physically ensure proper seating of the storage devices, i.e. pull them out and plug them back in.

Actions #13

Updated by ybonatakis 3 months ago

I have no idea if someone else did something before me.
I follow suggestion to reboot the iDRAC and then I had to turn the system on.

In the meantime I was trying to find how to update the firmware (iDRAC) but I didnt have that option. I logged in with creds from workerconf.sls. But trying to find something regarding this I navigated to the storage where I saw

I will leave it to it and check again on Monday

Actions #14

Updated by okurz 3 months ago

  • Related to action #166520: [alert][FIRING:1] qesapworker-prg5 (qesapworker-prg5: host up alert host_up openQA host_up_alert_qesapworker-prg5 worker) size:S added
Actions #15

Updated by okurz 3 months ago

For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.

Actions #16

Updated by okurz 3 months ago

  • Related to action #167557: OSD not starting new jobs on 2024-09-28 due to >1k worker instances connected, overloading websocket server added
Actions #17

Updated by ybonatakis 3 months ago

okurz wrote in #note-15:

For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.

i was thinnking to make them available again when I saw your comment. Can I try? or what do you expect as next step?

Actions #18

Updated by okurz 3 months ago

ybonatakis wrote in #note-17:

okurz wrote in #note-15:

For #167557 I removed qesapworker-prg5 from production per https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production as it's not fully usable right now anyway.

i was thinnking to make them available again when I saw your comment. Can I try? or what do you expect as next step?

well, as the same problem seems to have happened multiple times I still suggest you readd the tags you removed in #167164-13 and physically ensure proper seating of storage device connectors.

Actions #19

Updated by ybonatakis 3 months ago

  • Tags changed from infra, reactive work, alert to infra, reactive work, alert, next-prague-office-visit

just add the next-prague-office-visit which incidentally removed on friday

Actions #20

Updated by okurz 3 months ago

  • Priority changed from High to Normal
Actions #21

Updated by nicksinger 3 months ago

I've updated the following components:

  • iDRAC from 7.00.(something) to 7.10.70.00
  • SAS RAID Firmware from 52.16.1-4405 to 52.26.0-5179
  • BIOS from 1.10.2 to 1.15.2

I also cleared the iDRAC logs so you have a clean starting point to validate if the machine still has problems detecting its disks.

Actions #22

Updated by ybonatakis 3 months ago · Edited

I just turned on the system after @nicksinger updated. Disk showed up and all look normal

Actions #23

Updated by ybonatakis 3 months ago

Actions #24

Updated by ybonatakis 3 months ago

  • Status changed from Feedback to Resolved

Oli said that the worker class is disable on most of the workers. but this is related to another ticket(poo#??). resolved as per last actions from Nick and as we have the machine stable running in OSD. Disks are visible and we move to monitoring stage.

Actions #25

Updated by okurz 3 months ago

  • Due date deleted (2024-10-08)
Actions

Also available in: Atom PDF