action #166802: Recover worker37, worker38, worker39 size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz 7 months ago

Copied from action #139103: Long OSD ppc64le job queue - Decrease number of x86_64 worker slots on osd to give ppc64le jobs a better chance to be assigned jobs size:M added

Actions

Copy link

#2

Updated by okurz 7 months ago

Related to action #157726: osd-deployment | Failed pipeline for master (worker3[6-9].oqa.prg2.suse.org) added

Actions

Copy link

#3

Updated by okurz 7 months ago

Subject changed from Recover worker37, worker38, worker39 to Recover worker37, worker38, worker39 size:S
Description updated (diff)
Status changed from Blocked to Workable
Assignee deleted (~~okurz~~)

Actions

Copy link

#4

Updated by okurz 7 months ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

#5

Updated by okurz 7 months ago

Applying the same approach as on #139103-29 to downgrade the firewall.

Actions

Copy link

#6

Updated by okurz 7 months ago · Edited

I brought back all w37+w38+w39 but then encountered a problem that looks very similar to what we have seen in the past that no more jobs are picked up by workers. Following https://progress.opensuse.org/projects/openqav3/wiki/#Take-machines-out-of-salt-controlled-production I took w38+w39 out of production again. See #139103-31 for details.

Actions

Copy link

#7

Updated by openqa_review 7 months ago

Due date set to 2024-10-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#8

Updated by okurz 7 months ago

Related to action #167057: Run more standard, qemu OSD openQA jobs in CC-compliant PRG2 and none in NUE2 size:S added

Actions

Copy link

#9

Updated by okurz 7 months ago

Related to action #134924: Websocket server overloaded, affected worker slots shown as "broken" with graceful disconnect in workers table added

Actions

Copy link

#10

Updated by okurz 7 months ago

Due date deleted (~~2024-10-03~~)
Status changed from In Progress to Blocked

blocked by #167057 as we need to prevent #134924 and parent ticket(s) from impacting us. After #167057 we can try to enable more worker instances from PRG2 again.

Actions

Copy link

#11

Updated by nicksinger 7 months ago · Edited

due to reports in slack (also see #167081) I've removed w37 from production as well now.

Actions

Copy link

#12

Updated by nicksinger 7 months ago

Related to action #167081: test fails in support_server/setup on osd worker37 size:S added

Actions

Copy link

#13

Updated by livdywan 7 months ago

I moved my previous comment to #167081 which is about worker37 failing in production which according to #166802#note-11 it shouldn't even be. We need to take care to keep tickets updated.

Actions

Copy link

#14

Updated by okurz 6 months ago

Status changed from Blocked to In Progress

#167057 and #157690 are done so we can continue. I powered on w39 and accepted the salt key. Now doing sudo salt --state-output=changes -C 'G@roles:worker and G@osarch:x86_64' state.apply | grep -v Result

Actions

Copy link

#15

Updated by okurz 6 months ago

Related to action #157690: Simple global limit of registered/online workers size:M added

Actions

Copy link

#16

Updated by okurz 6 months ago · Edited

Status changed from In Progress to Blocked

I brought up w38 as well and did

for i in worker38 worker39; do openqa-clone-job --skip-chained-deps --repeat=60 --within-instance https://openqa.suse.de/tests/15639721 {TEST,BUILD}+=-poo166802-okurz _GROUP=0 WORKER_CLASS=$i; done

but I can already see on OSD in /var/log/openqa_scheduler: [2024-10-12T13:52:26.746206Z] [warn] [pid:24409] Failed to send data to websocket server, reason: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/WebSockets/Client.pm line 27. so #157690 is not effective, reopened and disabled and powered down both w37+w38 again.

Actions

Copy link

#17

Updated by okurz 6 months ago

Related to action #168178: Limit connected online workers based on websocket+scheduler load size:M added

Actions

Copy link

#18

Updated by okurz 6 months ago

Target version changed from Ready to future

Blocking on #168178

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #166802

Recover worker37, worker38, worker39 size:S

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago · Edited

Updated by openqa_review 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago · Edited

Updated by nicksinger 7 months ago

Updated by livdywan 7 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago · Edited

Updated by okurz 6 months ago

Updated by okurz 6 months ago