Project

General

Profile

Actions

action #159303

open

[alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S

Added by livdywan 13 days ago. Updated 12 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604

ERROR: Minions returned with non-zero exit code
worker32.oqa.prg2.suse.org:
openqaworker18.qa.suse.cz:
worker31.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
qesapworker-prg6.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
openqaworker14.qa.suse.cz:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqaworker1.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [Not connected]

Suggestions

  • DONE Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far
  • Ensure that after the machine is up again in #159270 the deployment is retried and works fine
  • Wait for #157753 to have complete automation
  • This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough
  • Teach livdywan that it was just one machine failing, not all the listed ones

Out of scope

  • Recovering openqaworker-arm-1: #159270

Related issues 2 (0 open2 closed)

Related to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:MResolvedybonatakis

Actions
Related to openQA Infrastructure - action #159270: openqaworker-arm-1 is Unreachable size:SResolvedybonatakis2024-04-19

Actions
Actions #1

Updated by okurz 13 days ago

  • Related to action #157753: Bring back automatic recovery for openqaworker-arm-1 size:M added
Actions #2

Updated by okurz 13 days ago

  • Category set to Regressions/Crashes

This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time.

Actions #3

Updated by okurz 13 days ago

  • Related to action #159270: openqaworker-arm-1 is Unreachable size:S added
Actions #4

Updated by okurz 13 days ago

  • Subject changed from [alert] osd-deployment pre-deploy pipeline fails Minion did not return with many workers not responding to [alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by nicksinger 12 days ago

  • Status changed from Workable to Blocked
  • Assignee set to nicksinger

The recovery pipeline can be triggered and runs as expected: https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/2513420
In grafana the according "Contact point" for "Trigger reboot of openqaworker-arm-1" is still present but labeled as "Unused". The deployment was successful after we recovered arm-1 manually: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/1088854

Basically this has to wait until #157753 is done to continue further.

Actions #6

Updated by nicksinger 12 days ago

  • Priority changed from High to Normal

Also as the immediate problem was mitigated I think we can lower the prio.

Actions

Also available in: Atom PDF