Project

General

Profile

Actions

action #122458

closed

O3 ipmi worker rebel:5 is broken size:M

Added by Julie_CAO almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-12-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

In O3 web UI, initially rebel:5 shows 'offline' status, I systemctl restart openqa-worker@5 on rebel, then the worker status in web UI changed to 'broken' as below.

rebel:5    rebel    64bit-ipmi_rebel,64bit-ipmi-large-mem_rebel,64bit-ipmi-amd_rebel,blackbauhinia_rebel    x86_64    **Broken**     1    34

Here is the failure:

 422138 Dec 26 08:52:57 rebel worker[24670]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382
 422139 Dec 26 08:52:57 rebel worker[6598]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 10>
 422140 Dec 26 08:52:57 rebel worker[24670]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
 422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked
 422142 Dec 26 08:52:57 rebel worker[24670]:  at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 757.
 422143 Dec 26 08:52:57 rebel worker[24670]:         OpenQA::Worker::_lock_pool_directory(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/o>
 422144 Dec 26 08:52:57 rebel worker[24670]:         eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 745
...
 422161 Dec 26 08:52:57 rebel worker[24670]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/openqa/script/wor>
 422162 Dec 26 08:52:57 rebel worker[24670]:  - checking again for web UI 'http://openqa1-opensuse' in 100.00 s

Could you help to fix the failure, or could you point me how to fix it?

Acceptance criteria

  • AC1: The openQA worker instance rebel:5 passes openQA jobs

Suggestions

  • Log into the machine "rebel" part of the o3 infrastructure, check process table, check logs, check files in pool directory. Try a reboot of the machine, monitor openQA jobs on the instance. Look for crashes of isotovideo or the openQA worker and any left-over lock files.

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #123028: A/C broken in TAM lab size:MResolvednicksinger2023-01-12

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Target version set to Ready

The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.

Actions #2

Updated by Julie_CAO almost 2 years ago

okurz wrote:

The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.

I did following steps to remove the lock but failed:

rebel:~ # systemctl stop openqa-worker@5

rebel:~ # cat /var/lib/openqa/pool/5/.locked
25273

rebel:~ # ps -ww 25273
25273 ?        Ss     0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

rebel:~ # kill -9 25273

rebel:~ # rm /var/lib/openqa/pool/5/.locked

rebel:~ # cat /var/lib/openqa/pool/5/.locked
cat: /var/lib/openqa/pool/5/.locked: No such file or directory

rebel:~ # systemctl restart openqa-worker@5

rebel:~ # systemctl status openqa-worker@5   //still failed with same errors(the pool is locked again)
     Active: active (running) since Mon 2022-12-26 10:59:28 CET; 4s ago
    Process: 28959 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/5 (code=exited, status=0/SUCCESS)
   Main PID: 28960 (worker)
      Tasks: 1 (limit: 4915)
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@5.service
             └─ 28960 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
...

rebel:~ # cat /var/lib/openqa/pool/5/.locked
27114

rebel:~ # ps -ef | grep 27114   //its parent process is '1', rather than the worker service I just restarted.
_openqa+ 27114     1  0 10:55 ?        00:00:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

I have no idea which process lock the openqa pool dir. Could anyone help?

Actions #3

Updated by xlai almost 2 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz
  • Priority changed from Normal to High
Actions #4

Updated by okurz almost 2 years ago

  • Status changed from Feedback to New
  • Assignee deleted (okurz)
  • Priority changed from High to Normal
Actions #5

Updated by xlai almost 2 years ago

Julie_CAO wrote:

okurz wrote:

The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.

I did following steps to remove the lock but failed:

rebel:~ # systemctl stop openqa-worker@5

rebel:~ # cat /var/lib/openqa/pool/5/.locked
25273

rebel:~ # ps -ww 25273
25273 ?        Ss     0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

rebel:~ # kill -9 25273

rebel:~ # rm /var/lib/openqa/pool/5/.locked

rebel:~ # cat /var/lib/openqa/pool/5/.locked
cat: /var/lib/openqa/pool/5/.locked: No such file or directory

rebel:~ # systemctl restart openqa-worker@5

rebel:~ # systemctl status openqa-worker@5   //still failed with same errors(the pool is locked again)
     Active: active (running) since Mon 2022-12-26 10:59:28 CET; 4s ago
    Process: 28959 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/5 (code=exited, status=0/SUCCESS)
   Main PID: 28960 (worker)
      Tasks: 1 (limit: 4915)
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@5.service
             └─ 28960 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
...

rebel:~ # cat /var/lib/openqa/pool/5/.locked
27114

rebel:~ # ps -ef | grep 27114   //its parent process is '1', rather than the worker service I just restarted.
_openqa+ 27114     1  0 10:55 ?        00:00:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5

I have no idea which process lock the openqa pool dir. Could anyone help?

@okurz, this is related with openqa worker and can't be easily solved by kill the lock process. It is O3 which needs to be careful on further handling, and would need openqa worker experts to share some solution to handle it elegantly. Would you or your team please give julie some help here? What can be done to clear the lock issue? This issue is blocking VT test deployment on O3 which we plan to finish before Jan 19(quite some further analysis&fixes work needed on our side AFTER worker is up). Is there any chance to let this issue be handled sooner? Thanks very much.

Actions #7

Updated by Julie_CAO almost 2 years ago

I checked again and the issue changed to be different. The worker rebel:5 seems to work, but the jobs failed to be assigned to it.

Jan 04 04:31:52 rebel worker[28960]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
Jan 04 04:31:53 rebel worker[28960]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 3011029
Jan 04 04:32:02 rebel worker[28960]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 1008, only one connection per worker allowed, finishing old one in favor of new one - trying again in 10 seconds
Jan 04 04:32:12 rebel worker[28960]: [info] Registering with openQA http://openqa1-opensuse
Jan 04 04:32:12 rebel worker[28960]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382
Jan 04 04:32:12 rebel worker[28960]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
Jan 04 04:32:14 rebel worker[28960]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 3011029

The jobs were restarted and imcomplete with Reason: abandoned: associated worker rebel:5 re-connected but abandoned the job repeatedly.
https://openqa.opensuse.org/admin/workers/382

It seems this issue need to be handled from openqa database side, I can't do more.

Actions #8

Updated by okurz almost 2 years ago

  • Project changed from openQA Infrastructure to openQA Project
  • Subject changed from O3 ipmi worker rebel:5 is broken to O3 ipmi worker rebel:5 is broken size:M
  • Description updated (diff)
  • Category set to Regressions/Crashes
  • Status changed from New to Workable
  • Priority changed from Normal to High
Actions #9

Updated by okurz almost 2 years ago

  • Tags set to reactive work
Actions #10

Updated by Julie_CAO over 1 year ago

Weird, the worker began to work.
https://openqa.opensuse.org/admin/workers/382

Actions #11

Updated by okurz over 1 year ago

Actions #12

Updated by okurz over 1 year ago

Resolution of the ticket will be blocked by #123028 but one can still investigate what's going on with the locking.

Actions #13

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #14

Updated by mkittler over 1 year ago

  • Status changed from Workable to Resolved

I systemctl restart openqa-worker@5 on rebel

You shouldn't have because that creates a 2nd instance of the worker (in addition to the instance started via openqa-worker-auto-restart@5) which will lead to problems locking the pool directory. So the problem you're getting is expected:

422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked

At this point only the auto-restart units are active so I suppose everything is fine again.

All slots except rebel:6 show up as idle but rebel:6 is only offline because the amd-zen2-gpu-sut1 is currently offline due to #123028.

Not sure why rebel:5 was initially offline because logs don't go so far.

Note that the documentation which states that openqa-worker@ and openqa-worker-auto-restart@ can be used interchangeably doesn't apply to o3 workers (as they don't actually use this salt repository and thus the symlinking might not be done that way; on o3 workers you really have to see how the individual hosts are setup before doing anything).

With that explanation I'd like to resolve the issue. Feel free to reopen if you have any further questions.

Actions #15

Updated by Julie_CAO over 1 year ago

mkittler wrote:

Thank you, @mkittler. I introduced another problem when I tried to fix a problem.

Actions

Also available in: Atom PDF