action #122458: O3 ipmi worker rebel:5 is broken size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #122458

closed

O3 ipmi worker rebel:5 is broken size:M

Added by Julie_CAO over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2022-12-26

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

In O3 web UI, initially rebel:5 shows 'offline' status, I systemctl restart openqa-worker@5 on rebel, then the worker status in web UI changed to 'broken' as below.

rebel:5    rebel    64bit-ipmi_rebel,64bit-ipmi-large-mem_rebel,64bit-ipmi-amd_rebel,blackbauhinia_rebel    x86_64    **Broken**     1    34

Here is the failure:

 422138 Dec 26 08:52:57 rebel worker[24670]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382
 422139 Dec 26 08:52:57 rebel worker[6598]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 10>
 422140 Dec 26 08:52:57 rebel worker[24670]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
 422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked
 422142 Dec 26 08:52:57 rebel worker[24670]:  at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 757.
 422143 Dec 26 08:52:57 rebel worker[24670]:         OpenQA::Worker::_lock_pool_directory(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/o>
 422144 Dec 26 08:52:57 rebel worker[24670]:         eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 745
...
 422161 Dec 26 08:52:57 rebel worker[24670]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/openqa/script/wor>
 422162 Dec 26 08:52:57 rebel worker[24670]:  - checking again for web UI 'http://openqa1-opensuse' in 100.00 s

Could you help to fix the failure, or could you point me how to fix it?

Acceptance criteria¶

AC1: The openQA worker instance rebel:5 passes openQA jobs

Suggestions¶

Log into the machine "rebel" part of the o3 infrastructure, check process table, check logs, check files in pool directory. Try a reboot of the machine, monitor openQA jobs on the instance. Look for crashes of isotovideo or the openQA worker and any left-over lock files.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Target version set to Ready

The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.

Actions

Copy link

Updated by Julie_CAO over 2 years ago

okurz wrote:

The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.

I did following steps to remove the lock but failed:

rebel:~ # systemctl stop openqa-worker@5

rebel:~ # cat /var/lib/openqa/pool/5/.locked
25273

rebel:~ # ps -ww 25273
25273 ?        Ss     0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

rebel:~ # kill -9 25273

rebel:~ # rm /var/lib/openqa/pool/5/.locked

rebel:~ # cat /var/lib/openqa/pool/5/.locked
cat: /var/lib/openqa/pool/5/.locked: No such file or directory

rebel:~ # systemctl restart openqa-worker@5

rebel:~ # systemctl status openqa-worker@5   //still failed with same errors(the pool is locked again)
     Active: active (running) since Mon 2022-12-26 10:59:28 CET; 4s ago
    Process: 28959 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/5 (code=exited, status=0/SUCCESS)
   Main PID: 28960 (worker)
      Tasks: 1 (limit: 4915)
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@5.service
             └─ 28960 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
...

rebel:~ # cat /var/lib/openqa/pool/5/.locked
27114

rebel:~ # ps -ef | grep 27114   //its parent process is '1', rather than the worker service I just restarted.
_openqa+ 27114     1  0 10:55 ?        00:00:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

I have no idea which process lock the openqa pool dir. Could anyone help?

Actions

Copy link

Updated by xlai over 2 years ago

Status changed from New to Feedback
Assignee set to okurz
Priority changed from Normal to High

Actions

Copy link

Updated by okurz over 2 years ago

Status changed from Feedback to New
Assignee deleted (~~okurz~~)
Priority changed from High to Normal

Actions

Copy link

Updated by xlai over 2 years ago

Julie_CAO wrote:

okurz wrote:

The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.

I did following steps to remove the lock but failed:

rebel:~ # systemctl stop openqa-worker@5

rebel:~ # cat /var/lib/openqa/pool/5/.locked
25273

rebel:~ # ps -ww 25273
25273 ?        Ss     0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

rebel:~ # kill -9 25273

rebel:~ # rm /var/lib/openqa/pool/5/.locked

rebel:~ # cat /var/lib/openqa/pool/5/.locked
cat: /var/lib/openqa/pool/5/.locked: No such file or directory

rebel:~ # systemctl restart openqa-worker@5

rebel:~ # systemctl status openqa-worker@5   //still failed with same errors(the pool is locked again)
     Active: active (running) since Mon 2022-12-26 10:59:28 CET; 4s ago
    Process: 28959 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/5 (code=exited, status=0/SUCCESS)
   Main PID: 28960 (worker)
      Tasks: 1 (limit: 4915)
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@5.service
             └─ 28960 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
...

rebel:~ # cat /var/lib/openqa/pool/5/.locked
27114

rebel:~ # ps -ef | grep 27114   //its parent process is '1', rather than the worker service I just restarted.
_openqa+ 27114     1  0 10:55 ?        00:00:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

I have no idea which process lock the openqa pool dir. Could anyone help?

@okurz, this is related with openqa worker and can't be easily solved by kill the lock process. It is O3 which needs to be careful on further handling, and would need openqa worker experts to share some solution to handle it elegantly. Would you or your team please give julie some help here? What can be done to clear the lock issue? This issue is blocking VT test deployment on O3 which we plan to finish before Jan 19(quite some further analysis&fixes work needed on our side AFTER worker is up). Is there any chance to let this issue be handled sooner? Thanks very much.

Actions

Copy link

Updated by Julie_CAO over 2 years ago

I checked again and the issue changed to be different. The worker rebel:5 seems to work, but the jobs failed to be assigned to it.

Jan 04 04:31:52 rebel worker[28960]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
Jan 04 04:31:53 rebel worker[28960]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 3011029
Jan 04 04:32:02 rebel worker[28960]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 1008, only one connection per worker allowed, finishing old one in favor of new one - trying again in 10 seconds
Jan 04 04:32:12 rebel worker[28960]: [info] Registering with openQA http://openqa1-opensuse
Jan 04 04:32:12 rebel worker[28960]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382
Jan 04 04:32:12 rebel worker[28960]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
Jan 04 04:32:14 rebel worker[28960]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 3011029

The jobs were restarted and imcomplete with Reason: abandoned: associated worker rebel:5 re-connected but abandoned the job repeatedly.
https://openqa.opensuse.org/admin/workers/382

It seems this issue need to be handled from openqa database side, I can't do more.

Actions

Copy link

Updated by okurz over 2 years ago

Project changed from openQA Infrastructure (public) to openQA Project (public)
Subject changed from O3 ipmi worker rebel:5 is broken to O3 ipmi worker rebel:5 is broken size:M
Description updated (diff)
Category set to Regressions/Crashes
Status changed from New to Workable
Priority changed from Normal to High

Actions

Copy link

Updated by okurz over 2 years ago

Tags set to reactive work

Actions

Copy link

#10

Updated by Julie_CAO over 2 years ago

Weird, the worker began to work.
https://openqa.opensuse.org/admin/workers/382

Actions

Copy link

#11

Updated by okurz over 2 years ago

Related to action #123028: A/C broken in TAM lab size:M added

Actions

Copy link

#12

Updated by okurz over 2 years ago

Resolution of the ticket will be blocked by #123028 but one can still investigate what's going on with the locking.

Actions

Copy link

#13

Updated by mkittler over 2 years ago

Assignee set to mkittler

Actions

Copy link

#14

Updated by mkittler over 2 years ago

Status changed from Workable to Resolved

I systemctl restart openqa-worker@5 on rebel

You shouldn't have because that creates a 2nd instance of the worker (in addition to the instance started via openqa-worker-auto-restart@5) which will lead to problems locking the pool directory. So the problem you're getting is expected:

422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked

At this point only the auto-restart units are active so I suppose everything is fine again.

All slots except rebel:6 show up as idle but rebel:6 is only offline because the amd-zen2-gpu-sut1 is currently offline due to #123028.

Not sure why rebel:5 was initially offline because logs don't go so far.

Note that the documentation which states that openqa-worker@ and openqa-worker-auto-restart@ can be used interchangeably doesn't apply to o3 workers (as they don't actually use this salt repository and thus the symlinking might not be done that way; on o3 workers you really have to see how the individual hosts are setup before doing anything).

With that explanation I'd like to resolve the issue. Feel free to reopen if you have any further questions.

Actions

Copy link

#15

Updated by Julie_CAO over 2 years ago

mkittler wrote:

Thank you, @mkittler. I introduced another problem when I tried to fix a problem.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #122458

O3 ipmi worker rebel:5 is broken size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 2 years ago

Updated by Julie_CAO over 2 years ago

Updated by xlai over 2 years ago

Updated by okurz over 2 years ago

Updated by xlai over 2 years ago

Updated by Julie_CAO over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by Julie_CAO over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by Julie_CAO over 2 years ago