Project

General

Profile

Actions

action #182306

closed

coordination #122665: [epic] Improved PowerVM testing

openQA worker instances on petrol reported as offline, no update in logs since multiple days

Added by okurz 20 days ago. Updated 18 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-05-13
Due date:
2025-05-29
% Done:

0%

Estimated time:

Description

Observation

This observation was triggered by https://suse.slack.com/archives/C02CANHLANP/p1747148336798609 thx to @acarvajal for asking:

@qa-tools what's the status of workers with qemu_ppc64le-large-mem,tap? I saw jobs scheduled in SLES for SAP 15-SP6 QR job group for a while, and see that there are currently no workers matching this, but only with qemu_ppc64le-large-mem,tap_secondary ... should we re-schedule tests with tap_secondary instead of tap?

petrol should have such slots but two slots are affected by #182303, other slots report as "offline" though the openQA worker processes are running

Actions #1

Updated by okurz 20 days ago

  • Priority changed from Normal to Urgent

Bumping to "Urgent" so that we have a chance to collect necessary data from the current live state before data vanishes.

Actions #3

Updated by livdywan 19 days ago

According to @okurz this should be a dev ticket - from the description I would have thought the tag is missing, so mentioning it here in case somebody else has the same thought

Actions #4

Updated by gpuliti 19 days ago

  • Assignee set to emiler
Actions #5

Updated by emiler 19 days ago

  • Status changed from New to In Progress
Actions #6

Updated by emiler 19 days ago

Looking at petrol.qe.nue2.suse.org. Last activity on worker services [1,4,5,6,7] was on May 8th, dormant since.
Workers [2,3] are unable to connect to openqa.suse.de:

May 14 07:14:25 petrol worker[4401]: [info] [pid:4401] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3408
May 14 07:14:27 petrol worker[4401]: [info] [pid:4401] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3408
May 14 07:14:59 petrol worker[4401]: [warn] [pid:4401] Websocket connection to http://openqa.suse.de/api/v1/ws/3408 finished by remote side with code 1006, no reason - trying again in 10 seconds

Collecting logs now.

Actions #7

Updated by emiler 19 days ago

Restarting openqa-worker-auto-restart@7.service seems to have re-established the worker: https://openqa.suse.de/admin/workers/3409 (Status: Idle).
Still not sure what the core issue is though.

Actions #8

Updated by emiler 19 days ago

Workers [1,4,5,6,7] are brought back and are now registered in https://openqa.suse.de/admin/workers.
I guess we can lower the urgency now?

Actions #9

Updated by okurz 19 days ago

  • Priority changed from Urgent to High

Thank you for that. The urgency is about collecting enough information to be able to fix the issue in openQA. I lowered to "High" as you did the first steps. As suggested in https://suse.slack.com/archives/C02AJ1E568M/p1747216433141269?thread_ts=1747216238.891779&cid=C02AJ1E568M

And next steps can be to record the process table, system journal, logs from workers, logs from openqa.suse.de corresponding to the timeframe or referencing the worker, etc. Anything that looks relevant and could be lost in case we e.g. reboot the machine

Actions #10

Updated by emiler 19 days ago

I tried scheduling something for a good measure (with WORKER_CLASS=petrol) and the job got picked up, so the workers are confirmed to be working at the moment (https://openqa.suse.de/admin/workers/3409).
Most of the logs are gathered, but I'll keep digging for more relevant info.

Actions #11

Updated by openqa_review 18 days ago

  • Due date set to 2025-05-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by gpathak 18 days ago ยท Edited

@nicksinger investigated about worker slots on petrol and diesel not connecting to OSD in past week or so, and found out that it was wireguard that wasn't letting the worker to register itself with OSD.
The NUE2 wireguard was down on 8th May 2025: https://suse.slack.com/archives/C02AJ1E568M/p1746703936377089

Ref: https://progress.opensuse.org/issues/181988

Actions #13

Updated by emiler 18 days ago

gpathak wrote in #note-12:

@nicksinger investigated about worker slots on petrol and diesel not connecting to OSD in past week or so, and found out that it was wireguard that wasn't letting the worker to register itself with OSD.
The NUE2 wireguard was down on 8th May 2025: https://suse.slack.com/archives/C02AJ1E568M/p1746703936377089

Ref: https://progress.opensuse.org/issues/181988

Thanks, that's a good insight and seems likely to be the cause. Which means the issue is resolved, but I'll try to confirm this is the case, so we don't overlook a different issue.

Actions #14

Updated by emiler 18 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF