action #122458
closedO3 ipmi worker rebel:5 is broken size:M
0%
Description
Observation¶
In O3 web UI, initially rebel:5 shows 'offline' status, I systemctl restart openqa-worker@5
on rebel, then the worker status in web UI changed to 'broken' as below.
rebel:5 rebel 64bit-ipmi_rebel,64bit-ipmi-large-mem_rebel,64bit-ipmi-amd_rebel,blackbauhinia_rebel x86_64 **Broken** 1 34
Here is the failure:
422138 Dec 26 08:52:57 rebel worker[24670]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382
422139 Dec 26 08:52:57 rebel worker[6598]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 10>
422140 Dec 26 08:52:57 rebel worker[24670]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked
422142 Dec 26 08:52:57 rebel worker[24670]: at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 757.
422143 Dec 26 08:52:57 rebel worker[24670]: OpenQA::Worker::_lock_pool_directory(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/o>
422144 Dec 26 08:52:57 rebel worker[24670]: eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 745
...
422161 Dec 26 08:52:57 rebel worker[24670]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/openqa/script/wor>
422162 Dec 26 08:52:57 rebel worker[24670]: - checking again for web UI 'http://openqa1-opensuse' in 100.00 s
Could you help to fix the failure, or could you point me how to fix it?
Acceptance criteria¶
- AC1: The openQA worker instance rebel:5 passes openQA jobs
Suggestions¶
- Log into the machine "rebel" part of the o3 infrastructure, check process table, check logs, check files in pool directory. Try a reboot of the machine, monitor openQA jobs on the instance. Look for crashes of isotovideo or the openQA worker and any left-over lock files.
Updated by okurz almost 2 years ago
- Target version set to Ready
The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.
Updated by Julie_CAO almost 2 years ago
okurz wrote:
The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.
I did following steps to remove the lock but failed:
rebel:~ # systemctl stop openqa-worker@5
rebel:~ # cat /var/lib/openqa/pool/5/.locked
25273
rebel:~ # ps -ww 25273
25273 ? Ss 0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
rebel:~ # kill -9 25273
rebel:~ # rm /var/lib/openqa/pool/5/.locked
rebel:~ # cat /var/lib/openqa/pool/5/.locked
cat: /var/lib/openqa/pool/5/.locked: No such file or directory
rebel:~ # systemctl restart openqa-worker@5
rebel:~ # systemctl status openqa-worker@5 //still failed with same errors(the pool is locked again)
Active: active (running) since Mon 2022-12-26 10:59:28 CET; 4s ago
Process: 28959 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/5 (code=exited, status=0/SUCCESS)
Main PID: 28960 (worker)
Tasks: 1 (limit: 4915)
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@5.service
└─ 28960 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
...
rebel:~ # cat /var/lib/openqa/pool/5/.locked
27114
rebel:~ # ps -ef | grep 27114 //its parent process is '1', rather than the worker service I just restarted.
_openqa+ 27114 1 0 10:55 ? 00:00:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
I have no idea which process lock the openqa pool dir. Could anyone help?
Updated by xlai almost 2 years ago
- Status changed from New to Feedback
- Assignee set to okurz
- Priority changed from Normal to High
Updated by okurz almost 2 years ago
- Status changed from Feedback to New
- Assignee deleted (
okurz) - Priority changed from High to Normal
Updated by xlai almost 2 years ago
Julie_CAO wrote:
okurz wrote:
The message says that the pool dir is already locked. Maybe some abnormal exit of a previous process caused this. Someone with ssh access can login and check if there is currently a process blocking the directory.
I did following steps to remove the lock but failed:
rebel:~ # systemctl stop openqa-worker@5 rebel:~ # cat /var/lib/openqa/pool/5/.locked 25273 rebel:~ # ps -ww 25273 25273 ? Ss 0:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5 rebel:~ # kill -9 25273 rebel:~ # rm /var/lib/openqa/pool/5/.locked rebel:~ # cat /var/lib/openqa/pool/5/.locked cat: /var/lib/openqa/pool/5/.locked: No such file or directory rebel:~ # systemctl restart openqa-worker@5 rebel:~ # systemctl status openqa-worker@5 //still failed with same errors(the pool is locked again) Active: active (running) since Mon 2022-12-26 10:59:28 CET; 4s ago Process: 28959 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/5 (code=exited, status=0/SUCCESS) Main PID: 28960 (worker) Tasks: 1 (limit: 4915) CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@5.service └─ 28960 /usr/bin/perl /usr/share/openqa/script/worker --instance 5 ... rebel:~ # cat /var/lib/openqa/pool/5/.locked 27114 rebel:~ # ps -ef | grep 27114 //its parent process is '1', rather than the worker service I just restarted. _openqa+ 27114 1 0 10:55 ? 00:00:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
I have no idea which process lock the openqa pool dir. Could anyone help?
@okurz, this is related with openqa worker and can't be easily solved by kill
the lock process. It is O3 which needs to be careful on further handling, and would need openqa worker experts to share some solution to handle it elegantly. Would you or your team please give julie some help here? What can be done to clear the lock issue? This issue is blocking VT test deployment on O3 which we plan to finish before Jan 19(quite some further analysis&fixes work needed on our side AFTER worker is up). Is there any chance to let this issue be handled sooner? Thanks very much.
Updated by Julie_CAO almost 2 years ago
I checked again and the issue changed to be different. The worker rebel:5 seems to work, but the jobs failed to be assigned to it.
Jan 04 04:31:52 rebel worker[28960]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
Jan 04 04:31:53 rebel worker[28960]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 3011029
Jan 04 04:32:02 rebel worker[28960]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 1008, only one connection per worker allowed, finishing old one in favor of new one - trying again in 10 seconds
Jan 04 04:32:12 rebel worker[28960]: [info] Registering with openQA http://openqa1-opensuse
Jan 04 04:32:12 rebel worker[28960]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382
Jan 04 04:32:12 rebel worker[28960]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382
Jan 04 04:32:14 rebel worker[28960]: [warn] Refusing to grab job from http://openqa1-opensuse: already busy with job(s) 3011029
The jobs were restarted and imcomplete with Reason: abandoned: associated worker rebel:5 re-connected but abandoned the job
repeatedly.
https://openqa.opensuse.org/admin/workers/382
It seems this issue need to be handled from openqa database side, I can't do more.
Updated by okurz almost 2 years ago
- Project changed from openQA Infrastructure to openQA Project
- Subject changed from O3 ipmi worker rebel:5 is broken to O3 ipmi worker rebel:5 is broken size:M
- Description updated (diff)
- Category set to Regressions/Crashes
- Status changed from New to Workable
- Priority changed from Normal to High
Updated by Julie_CAO over 1 year ago
Weird, the worker began to work.
https://openqa.opensuse.org/admin/workers/382
Updated by okurz over 1 year ago
- Related to action #123028: A/C broken in TAM lab size:M added
Updated by okurz over 1 year ago
Resolution of the ticket will be blocked by #123028 but one can still investigate what's going on with the locking.
Updated by mkittler over 1 year ago
- Status changed from Workable to Resolved
I systemctl restart openqa-worker@5 on rebel
You shouldn't have because that creates a 2nd instance of the worker (in addition to the instance started via openqa-worker-auto-restart@5) which will lead to problems locking the pool directory. So the problem you're getting is expected:
422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked
At this point only the auto-restart units are active so I suppose everything is fine again.
All slots except rebel:6 show up as idle but rebel:6 is only offline because the amd-zen2-gpu-sut1 is currently offline due to #123028.
Not sure why rebel:5 was initially offline because logs don't go so far.
Note that the documentation which states that openqa-worker@
and openqa-worker-auto-restart@
can be used interchangeably doesn't apply to o3 workers (as they don't actually use this salt repository and thus the symlinking might not be done that way; on o3 workers you really have to see how the individual hosts are setup before doing anything).
With that explanation I'd like to resolve the issue. Feel free to reopen if you have any further questions.