action #180989
closed[openQA][worker][ipmi] ipmi jobs are stuck in tear-down and run into MAX_JOB_TIME (was: It takes an extremely long time to finish the job when connected to an ipmi worker) size:S
0%
Description
Observation¶
Recently I encountered the same issue many times, namely it takes extremely long time to release certain worker, for example, grenache-1:14.
Initially I scheduled two test runs on the same worker to verify another issue, but I waited quite long time after the first job finished and before the second job started running .
Worker grenache-1:14 took more than 6.5 hours to finish its work on the first job modules of which actually only took less than 4 hours. This means it took another 2.5 hours for the worker grenache-1:14 to be released before taking new job. See this example job.
I know worker may need to do some post-testrun work, upload logs and clean itself up before taking new job. But 2.5 hours is still too long for this case. It seems that there is something wrong with it. I personally did not notice the same issue with other workers till now.
Steps to reproduce¶
- Schedule two jobs on grenache-1:14
- Wait for the second job to be picked up after the first one finishes.
Impact¶
It might not have big impact on large-scale test run overall, but it does affect my verification progress.
Problem¶
Looks like the worker lingers on some unnoticed work after finishing its work on a job.
Suggestions¶
- Check worker process/log on worker machine (ps -auxf on grenache-1 or sudo systemctl list-units openqa-worker*)
- Do maintenance or cleanup if necessary
- Find the relevant code in os-autoinst, e.g. grep for the error messages
- DONE See if the problem reproduces when setting MAX_JOB_TIME=1
Workaround¶
n/a
Updated by livdywan about 1 month ago
- Tags set to infra, reactive work
- Category set to Regressions/Crashes
- Priority changed from Normal to High
- Target version set to Ready
Probably good to check soon as this would slow all jobs by the sounds of it.
Updated by robert.richardson 26 days ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by openqa_review 25 days ago
- Due date set to 2025-05-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by robert.richardson 22 days ago · Edited
As the job mentioned in the description was already deleted, i had a look at another affected jobs autoinst.log, there seems to be a orphan process causing the issue, rerun was also affected (called using openqa-clone-job --within-instance "https://openqa.suse.de/tests/17440104" --skip-chained-deps WORKER_CLASS=amd-zen3-gpu-sut1 _GROUP=0 {TEST,BUILD}+=-poo180689
)
[2027-04-24T14:44:17.599Z] [debug] [pid:355418] IPMI: Chassis Power is on
[2025-04-24T14:44:17.602924Z] [debug] [pid:355398] backend shutdown state:
[2025-04-24T14:44:18.104431Z] [debug] [pid:355418] IPMI:
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":36913"
after 28805 requests (28648 known processed) with 0 events remaining.
[2025-04-24T14:44:18.106254Z] [debug] [pid:355418] Passing remaining frames to the video encoder
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":36913"
[2025-04-24T14:44:18.108519Z] [info] [pid:355398] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 357109 and exit status: 1
[2025-04-24T14:44:22.154614Z] [info] [pid:355398] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 357115 and exit status: 0
[2025-04-24T18:24:14.586636Z] [debug] [pid:355398] isotovideo received signal TERM
[2025-04-24T18:24:14.586636Z] [debug] [pid:355418] backend got TERM
Any suggestions ? Maybe i can modify _collect_orphan to output the cmd associated with the stuck process ?
Also this is the workers systemd output (sudo journalctl -u openqa-worker-auto-restart@14.service --since "202 5-04-24 10:23"
):
Apr 24 14:44:26 grenache-1 worker[415547]: [debug] [pid:415547] Uploading artefact validate_system_health-12.txt
Apr 24 14:44:26 grenache-1 worker[442143]: [debug] [pid:442143] Upload concluded (up to validate_system_health)
Apr 24 14:44:36 grenache-1 worker[442143]: [debug] [pid:442143] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/17440104/status"
Apr 24 14:44:36 grenache-1 worker[442143]: [debug] [pid:442143] Upload concluded (no current module)
...
Apr 24 18:24:11 grenache-1 worker[442143]: [debug] [pid:442143] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/17440104/status"
Apr 24 18:24:11 grenache-1 worker[442143]: [debug] [pid:442143] Upload concluded (no current module)
Apr 24 18:24:14 grenache-1 worker[442143]: [debug] [pid:442143] Stopping job 17440104 from openqa.suse.de: 17440104-sle-15-SP7-Online-x86_64-Build87.1-gi-guest_developing-on-host_developing->
Apr 24 18:24:14 grenache-1 worker[442143]: [debug] [pid:442143] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/17440104/status"
Apr 24 18:24:14 grenache-1 worker[442143]: [debug] [pid:442143] Announcing job termination (due to timeout) to command server via http://localhost:20143/5aNWfMLXb5Xg5VlL/broadcast
Apr 24 18:24:14 grenache-1 worker[442143]: [debug] [pid:442143] Unable to announce job termination (NOT the reason for the job termination):
Apr 24 18:24:14 grenache-1 worker[442143]: [debug] [pid:442143] Command server is likely finished already
Apr 24 18:24:15 grenache-1 worker[442143]: [debug] [pid:442143] Registered process:355478
Apr 24 18:24:19 grenache-1 worker[442143]: [info] [pid:442143] +++ worker notes +++
Apr 24 18:24:19 grenache-1 worker[442143]: [info] [pid:442143] End time: 2025-04-24 18:24:19
Apr 24 18:24:19 grenache-1 worker[442143]: [info] [pid:442143] Result: timeout
Updated by livdywan 21 days ago
robert.richardson wrote in #note-6:
As the job mentioned in the description was already deleted, i had a look at another affected jobs autoinst.log, there seems to be a orphan process causing the issue, rerun was also affected (called using
openqa-clone-job --within-instance "https://openqa.suse.de/tests/17440104" --skip-chained-deps WORKER_CLASS=amd-zen3-gpu-sut1 _GROUP=0 {TEST,BUILD}+=-poo180689
)[2027-04-24T14:44:17.599Z] [debug] [pid:355418] IPMI: Chassis Power is on [2025-04-24T14:44:17.602924Z] [debug] [pid:355398] backend shutdown state: [2025-04-24T14:44:18.104431Z] [debug] [pid:355418] IPMI: XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":36913" after 28805 requests (28648 known processed) with 0 events remaining. [2025-04-24T14:44:18.106254Z] [debug] [pid:355418] Passing remaining frames to the video encoder xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":36913" [2025-04-24T14:44:18.108519Z] [info] [pid:355398] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 357109 and exit status: 1 [2025-04-24T14:44:22.154614Z] [info] [pid:355398] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 357115 and exit status: 0 [2025-04-24T18:24:14.586636Z] [debug] [pid:355398] isotovideo received signal TERM [2025-04-24T18:24:14.586636Z] [debug] [pid:355418] backend got TERM
Any suggestions ? Maybe i can modify _collect_orphan to output the cmd associated with the stuck process ?
I think extending the output would be a good "minimal improvement" here. We don't know for sure how relevant it is, but there is several of these backend::driver::_collect_orphan: Driver backend collected unknown process with pid 762893 and exit status: 84
messages. Would be good to know what command it is, and if it's the same. Maybe we can even get the output?
Updated by okurz 21 days ago
- Priority changed from High to Normal
- You asked about this topic already in Slack. Can you please feed back the suggestions provided there?
- Please check other machines besides grenache-1:14 if any are affected the same
- Apparently the problem is not that jobs are not picked up soon but that each job on grenache-1:14 which is controlling amd-zen3-gpu-sut1 runs into a timeout of ridiculous 8h which is selected by test maintainers which obviously leads to very slow execution
- There are more recent jobs on https://openqa.suse.de/admin/workers/4000 which are not running into the timeout
- I suggest you create an additional improvement for os-autoinst in general to prevent the 4h of no output with reproducer that is quicker than 8h
Updated by robert.richardson 21 days ago
- Description updated (diff)
- Status changed from In Progress to Workable
okurz wrote in #note-8:
- You asked about this topic already in Slack. Can you please feed back the suggestions provided there?
I've updated the ticket description to include your suggestions. Heres a link for reference.
The orphan process issue does not show up with MAX_JOB_TIME set to 1.
okurz wrote in #note-8:
- Please check other machines besides grenache-1:14 if any are affected the same
The _collect_orphan calls as well as the fatal io errors show up on grenache in general.
for example:
grenache-1:15
- sle-15-SP7-Online-aarch64-Build87.1-kdump@ipmi-64bit-squiddlydiddly (passed)
- sle-15-SP7-Online-aarch64-Build87.1-gi-guest_developing-on-host_sles12sp5-kvm@virt-arm-64bit-ipmi (failed)
Other workers i checked such as Diesel, Petrol or Mania do not seem to have the issue.
okurz wrote in #note-8:
- Apparently the problem is not that jobs are not picked up soon but that each job on grenache-1:14 which is controlling amd-zen3-gpu-sut1 runs into a timeout of ridiculous 8h which is selected by test maintainers which obviously leads to very slow execution.
But almost all tests on grenache show the problematic output, only for some tests whatever happens after [debug] ... Passing remaining frames to the video encoder
takes considerably longer then others and ends up running into the MAX_JOB_TIME timeout, or am i missing something here ?
okurz wrote in #note-8:
- There are more recent jobs on https://openqa.suse.de/admin/workers/4000 which are not running into the timeout
Yes, many do not run into the timeout though as mentioned above they will still include
[2025-04-28T01:42:38.431480Z] [debug] [pid:324697] Passing remaining frames to the video encoder
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":60297"
[2025-04-28T01:42:38.432395Z] [info] [pid:324683] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 325009 and exit status: 1
[2025-04-28T01:42:39.438646Z] [info] [pid:324683] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 325012 and exit status: 0
frame= 4272 fps=0.8 q=0.0 Lsize= 5095kB time=00:02:57.95 bitrate= 234.5kbits/s speed=0.035x
video:5067kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.552782%
[2025-04-28T01:50:37.370595Z] [debug] [pid:324697] Waiting for video encoder to finalize the video
or similar, with varying durations.
okurz wrote in #note-8:
- I suggest you create an additional improvement for os-autoinst in general to prevent the 4h of no output with reproducer that is quicker than 8h
i might need some help with creating the reproducer, i wanted to just schedule a simple boot to textmode test with a appropriate MAX_JOB_TIME value, though cloning already didnt work.
What would be a appropriate (short) test to clone for this, and do i need to specify the baremetal machine as well, besides the workerclass, or other specific settings ?
Updated by livdywan 21 days ago
okurz wrote in #note-8:
- I suggest you create an additional improvement for os-autoinst in general to prevent the 4h of no output with reproducer that is quicker than 8h
I just had a look at Mojo::IOLoop::ReadWriteProcess::Session#collected_orphan. Not sure args
can be read out here, but maybe /proc/$pid/cmdline
still gives us something useful?
i might need some help with creating the reproducer, i wanted to just schedule a simple boot to textmode test with a appropriate MAX_JOB_TIME value, though cloning already didnt work.
What would be a appropriate (short) test to clone for this, and do i need to specify the baremetal machine as well, besides the workerclass, or other specific settings ?
Maybe @waynechen55 has a suggestion here? I'd suggest asking first thing your morning in Slack and you will probably get a quick response.
Updated by livdywan 19 days ago
Maybe @waynechen55 has a suggestion here? I'd suggest asking first thing your morning in Slack and you will probably get a quick response.
Wayne is no hols, but I asked anyway, maybe someone else has a suggestion 🙃
Updated by robert.richardson 19 days ago
- Status changed from Workable to Feedback
Updated by robert.richardson 19 days ago · Edited
We created some jobs with differing MAX_JOB_TIMES, using
for i in 30 90 1200 3600; do openqa-clone-job --within-instance "https://openqa.suse.de/tests/17440104" --skip-chained-deps WORKER_CLASS=amd-zen3-gpu-sut1 _GROUP=0 MAX_JOB_TIME=$i BUILD+=-poo180689 TEST+=-poo180689-t$i; done
see:
https://openqa.suse.de/tests/overview?distri=sle&build=87.1-poo180689&version=15-SP7
as well as a follow-up ticket regarding the line ... backend shutdown state:
which should return a value.
Updated by okurz 19 days ago
- Subject changed from [openQA][worker][ipmi] It takes an extremely long time to finish the job when connected to an ipmi worker size:S to [openQA][worker][ipmi] ipmi jobs are stuck in tear-down and run into MAX_JOB_TIME (was: It takes an extremely long time to finish the job when connected to an ipmi worker) size:S
Updated by okurz 15 days ago
https://openqa.suse.de/tests/17540050#settings shows "MAX_JOB_TIME" as "$" so your script snippet does not pass the variable
Updated by robert.richardson 14 days ago
fixed the command and restarted the mentioned experiment jobs (gi-guest_developing-on-host_developing-kvm-poo180689-t-novideo-3-*), these include passing NOVIDEO=1
to check if the video creation is causing the issue, now waiting for results.
Updated by robert.richardson 13 days ago · Edited
so at least the variable was passed this time (NOVIDEO=1
),
however it still tries to pass frames to the video encoder, maybe i misunderstood something here but i would've expected this line to not show up (see os-autoinst/doc/backend_vars), is this a seperate issue or am i missing something?
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":36913"
after 28805 requests (28648 known processed) with 0 events remaining.
[2025-04-24T14:44:18.106254Z] [debug] [pid:355418] Passing remaining frames to the video encoder
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":36913"
[2025-04-24T14:44:18.108519Z] [info] [pid:355398] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 357109 and exit status: 1
[2025-04-24T14:44:22.154614Z] [info] [pid:355398] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 357115 and exit status: 0
[2025-04-24T18:24:14.586636Z] [debug] [pid:355398] isotovideo received signal TERM
[2025-04-24T18:24:14.586636Z] [debug] [pid:355418] backend got TERM
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":36705"
after 28828 requests (28827 known processed) with 0 events remaining.
[2025-04-24T18:24:14.591307Z] [debug] [pid:355418] sending magic and exit
Updated by okurz 13 days ago
- Related to action #181712: IPMI tests return empty backend shutdown state added
Updated by okurz 13 days ago
I guess we still need the videoencoder to process screenshots which we record regardless. Anyway, I guess this means that the actual video encoding does not change the results.
Another thing: I found that the jobs have TIMEOUT_SCALE=3 meaning that also MAX_JOB_TIMEOUT is scaled by that multiplier explaining why they do not stop immediately after the expected time. Please focus on getting the best reproducer and report a separate openQA improvement ticket outside "infra" and then help the OP to understand that jobs take that long to be picked up because of a very long MAX_JOB_TIMEOUT*TIMEOUT_SCALE combination.
Updated by robert.richardson 12 days ago
- Copied to action #182117: [openQA][worker][ipmi][sporadic] grenache ipmi tests stuck "Waiting for 1 jobs to finish" leading to timeouts added
Updated by robert.richardson 12 days ago
- Related to action #182123: [openQA][worker][ipmi] Common fatal io errors "Resource temporarily unavailable" and orphaned processes on grenache-1 IPMI tests added
Updated by robert.richardson 12 days ago
- Status changed from In Progress to Feedback
As mentioned in the daily, after seeing this tests autoinst.log i think that we might have been on the wrong track, as it actually fits to the description of the ticket alot better.
Sadly the job linked in the observation was already deleted when i picked up the ticket, so i/we assumed this ticket must be about these fatal io errors and orphan process cleanups showing up on grenache tests leading to a long cleanup time and timeouts in certain cases (example). However looking at the above log, these errors we've been looking at MAY actually be a separate issue.
I have for now created created two follow up issues, one regarding the fatal io errors and orphaned processes, and another about the ... Waiting for 1 jobs to finish
which can be seen in the log linked above.
@waynechen55
As Oli has mentioned, for now please consider lowering your timeout values and be aware of the impact of TIMEOUT_SCALE=3
.
Updated by waynechen55 9 days ago
robert.richardson wrote in #note-23:
As mentioned in the daily, after seeing this tests autoinst.log i think that we might have been on the wrong track, as it actually fits to the description of the ticket alot better.
Sadly the job linked in the observation was already deleted when i picked up the ticket, so i/we assumed this ticket must be about these fatal io errors and orphan process cleanups showing up on grenache tests leading to a long cleanup time and timeouts in certain cases (example). However looking at the above log, these errors we've been looking at MAY actually be a separate issue.
I have for now created created two follow up issues, one regarding the fatal io errors and orphaned processes, and another about the
... Waiting for 1 jobs to finish
which can be seen in the log linked above.@waynechen55
As Oli has mentioned, for now please consider lowering your timeout values and be aware of the impact ofTIMEOUT_SCALE=3
.
Lowering MAX_JOB_TIME and TIMEOUT_SCALE can just workaround the issue temporarily ???