Project

General

Profile

Actions

action #159303

closed

[alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S

Added by livdywan 28 days ago. Updated 9 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604

ERROR: Minions returned with non-zero exit code
worker32.oqa.prg2.suse.org:
openqaworker18.qa.suse.cz:
worker31.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
qesapworker-prg6.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
openqaworker14.qa.suse.cz:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqaworker1.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [Not connected]

Suggestions

  • DONE Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far
  • Ensure that after the machine is up again in #159270 the deployment is retried and works fine
  • Wait for #157753 to have complete automation
  • This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough
  • Teach livdywan that it was just one machine failing, not all the listed ones

Out of scope

  • Recovering openqaworker-arm-1: #159270

Related issues 2 (1 open1 closed)

Related to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:MResolvedybonatakis

Actions
Related to openQA Infrastructure - action #159270: openqaworker-arm-1 is Unreachable size:SFeedbackybonatakis2024-04-192024-05-28

Actions
Actions #1

Updated by okurz 28 days ago

  • Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Actions #2

Updated by okurz 28 days ago

  • Category set to Regressions/Crashes

This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time.

Actions #3

Updated by okurz 28 days ago

  • Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Actions #4

Updated by okurz 28 days ago

  • Subject changed from [alert] osd-deployment pre-deploy pipeline fails Minion did not return with many workers not responding to [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by nicksinger 27 days ago

  • Status changed from Workable to Blocked
  • Assignee set to nicksinger

The recovery pipeline can be triggered and runs as expected: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513420
In grafana the according "Contact point" for "Trigger reboot of openqaworker-arm-1" is still present but labeled as "Unused". The deployment was successful after we recovered arm-1 manually: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1088854

Basically this has to wait until #157753 is done to continue further.

Actions #6

Updated by nicksinger 27 days ago

  • Priority changed from High to Normal

Also as the immediate problem was mitigated I think we can lower the prio.

Actions #7

Updated by nicksinger 9 days ago

  • Status changed from Blocked to Resolved

As discussed in the unblock, we see the main point of validating if the automatic recovery works as covered. If we encounter conflicts in the future between deployment and recovery, we will create a new ticket.

Actions

Also available in: Atom PDF