action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #157726

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)

Added by livdywan about 1 year ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-03-18

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/2415705

worker37.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker36.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker38.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
worker39.oqa.prg2.suse.org:
    Minion did not return. [Not connected]

Acceptance criteria¶

AC1: osd-deployment passes again
AC1: All w37-w39 run OSD production jobs

Suggestions¶

DONE Take machine out of production: https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production
*DONE Remove machine XYZ from production ssh osd "sudo salt-key -y -d XYZ"
Retrigger failed osd deployment CI pipeline
Confirm if this is one or multiple, possibly already known issues
Fix any potential hardware issue, e.g. with hardware replacement
Ensure machines are back in production

Rollback steps¶

https://progress.opensuse.org/projects/openqav3/wiki/#Bring-back-machines-into-salt-controlled-production for i in 36 37 38 39 ; do sudo salt-key -y -a worker$i.oqa.prg2.suse.org; done && sleep 30 && for i in 36 37 38 39 ; do sudo salt --state-output=changes "worker$i*" state.apply

Related issues 6 (3 open — 3 closed)

Related to openQA Infrastructure (public) - action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21

Resolved

okurz

2024-03-12

Actions

Related to openQA Project (public) - coordination #157669: websockets+scheduler improvements to support more online worker instances

New

2023-08-31

Actions

Related to openQA Infrastructure (public) - action #166802: Recover worker37, worker38, worker39 size:S

Blocked

okurz

Actions

Related to openQA Infrastructure (public) - action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M

Resolved

okurz

2023-11-04

Actions

Copied to openQA Infrastructure (public) - action #181817: osd-deployment | Failed pipeline for master: openQA-in-openQA tests failing

Resolved

livdywan

2024-03-18

Actions

Copied to openQA Infrastructure (public) - action #181850: osd-deployment | Failed pipeline for master (worker35.oqa.prg2.suse.org and monitor.qe.nue2.suse.org)

Blocked

okurz

2024-03-18

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #157726

osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org)

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz 9 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 7 months ago

Updated by gpathak 3 days ago

Updated by gpathak 3 days ago