Project

General

Profile

Actions

action #159303

closed

[alert] osd-deployment pre-deploy pipeline failed because openqaworker-arm-1.qe.nue2.suse.org was offline size:S

Added by livdywan 28 days ago. Updated 9 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

See https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2510604

ERROR: Minions returned with non-zero exit code
worker32.oqa.prg2.suse.org:
openqaworker18.qa.suse.cz:
worker31.oqa.prg2.suse.org:
worker35.oqa.prg2.suse.org:
worker33.oqa.prg2.suse.org:
worker29.oqa.prg2.suse.org:
worker30.oqa.prg2.suse.org:
worker40.oqa.prg2.suse.org:
worker34.oqa.prg2.suse.org:
openqaworker16.qa.suse.cz:
openqaworker17.qa.suse.cz:
worker-arm2.oqa.prg2.suse.org:
worker-arm1.oqa.prg2.suse.org:
qesapworker-prg4.qa.suse.cz:
qesapworker-prg6.qa.suse.cz:
qesapworker-prg5.qa.suse.cz:
qesapworker-prg7.qa.suse.cz:
sapworker2.qe.nue2.suse.org:
sapworker3.qe.nue2.suse.org:
sapworker1.qe.nue2.suse.org:
openqaworker14.qa.suse.cz:
mania.qe.nue2.suse.org:
petrol.qe.nue2.suse.org:
openqaworker1.qe.nue2.suse.org:
imagetester.qe.nue2.suse.org:
diesel.qe.nue2.suse.org:
grenache-1.oqa.prg2.suse.org:
openqaworker-arm-1.qe.nue2.suse.org:
    Minion did not return. [Not connected]

Suggestions

  • DONE Retry the pipeline, could be temporary - seems to persist over a couple hours at least so far
  • Ensure that after the machine is up again in #159270 the deployment is retried and works fine
  • Wait for #157753 to have complete automation
  • This is related to #157753 . openqaworker-arm-1 is expected to be sometimes not responsive but automatic recovery should handle that already. Please check if the automatic recovery from https://gitlab.suse.de/openqa/grafana-webhook-actions is effective and how we can avoid the osd-deployment to be hung up because of this one host which is expected to be broken from time to time. Let's assume that our automatic recovery is good enough and always recovers the machine soon enough
  • Teach livdywan that it was just one machine failing, not all the listed ones

Out of scope

  • Recovering openqaworker-arm-1: #159270

Related issues 2 (0 open2 closed)

Related to QA - action #157753: Bring back automatic recovery for openqaworker-arm-1 size:MResolvedybonatakis

Actions
Related to openQA Infrastructure - action #159270: openqaworker-arm-1 is Unreachable size:SResolvedybonatakis2024-04-19

Actions
Actions

Also available in: Atom PDF