action #163772
closed[openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run size:M
0%
Description
Observation¶
Jobs assigned to worker35:x can not really run actually. They all hang, no output from live view and live log, no loaded test modules, although they are in running state.
Job 14896205 assigned to worker35:50 in running state and hang
Job 14896209 assigned to worker35:51 in running state and hang
Job 14896235 assigned to worker35:49 in running state and hang
Failures with reason isotovideo died: Unable to clone Git repository 'https://github.com/waynechen55/os-autoinst-distri-opensuse.git#wayne/enable_kernel_log' specified via CASEDIR (see log for details) at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 164.
look like so:
[2024-07-11T16:35:17.821676+02:00] [info] [pid:58222] ::: OpenQA::Isotovideo::Utils::clone_git: Cloning git URL 'https://github.com/waynechen55/os-autoinst-distri-opensuse.git' into '/var/lib/openqa/pool/49'[2024-07-11T16:35:17.821776+02:00] [info] [pid:58222] ::: OpenQA::Isotovideo::Utils::clone_git: Checking out git refspec/branch 'wayne/enable_kernel_log'[2024-07-11T16:36:00.214318+02:00] [debug] [pid:58222] Cloning into 'os-autoinst-distri-opensuse'...
error: RPC failed; curl 18 HTTP/2 stream 5 was not closed cleanly before end of the underlying connection
error: 1732 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
Or they fail with reason Reason: backend died: ipmitool -I lanplus -H fibonacci-ipmi.qe.prg2.suse.org -U ADMIN -P [masked] mc guid: Error: Received an Unexpected Open Session Response
look like so:
[2024-07-11T16:41:46.020309+02:00] [debug] [pid:63553] Launching external video encoder: ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1 'video.webm'[2024-07-11T16:41:50.111055+02:00] [info] [pid:63553] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H fibonacci-ipmi.qe.prg2.suse.org -U ADMIN -P [masked] mc guid: Error: Received an Unexpected Open Session Response
Error: Received an Unexpected Open Session Response
[...]
Error: Received an Unexpected Open Session Response
Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.[2024-07-11T16:41:50.111690+02:00] [debug] [pid:63553] Passing remaining frames to the video encoder
[image2pipe @ 0x5568ecfe1480] Could not find codec parameters for stream 0 (Video: ppm, none): unspecified size
Consider increasing the value for the 'analyzeduration' (0) and 'probesize' (5000000) options
Steps to reproduce¶
- Trigger ipmi backend job
- Job assigned to worker35:x
Impact¶
Can not run job efficiently and effectively
Problem¶
Looks like problem related to worker process
Acceptance criteria¶
- AC1: Worker processes jobs successfully again
Suggestions¶
- Check relevant worker process
- Check worker settings
- Check other related processes
- Confirm if this is an issue with the job/worker setup, or there is an underlying bug e.g. assets being slow to download, git sources being slow
- Confirm if this is one issue or two separate issues and file follow-up tickets as needed
Workaround¶
n/a
Files
Updated by waynechen55 5 months ago
- Subject changed from [openQA][worker] Assignded jobs hang and actually can not run to [openQA][worker35:x] Assignded jobs hang and actually can not run
Updated by okurz 5 months ago
- Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Updated by waynechen55 5 months ago
- Subject changed from [openQA][worker35:x] Assigned jobs hang and actually can not run to [openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run
Updated by mkittler 5 months ago ยท Edited
Considering the jobs ended up as incomplete (Reason: backend died: ipmitool -I lanplus -H quinn-ipmi.qe.prg2.suse.org -U ADMIN -P [masked] mc selftest: Error: Received an Unexpected Open Session Response
) I don't think the worker or the web UI is to blame.
I had a closer look at https://openqa.suse.de/tests/14896205. It took over 15 minutes to sync assets. Then it took 6 minutes to sync tests via the cache service. Because CASEDIR
was set to a Git URL it took another 6 minutes to clone tests via Git. In total it took from [2024-07-11T16:11:25.079497+02:00] to [2024-07-11T16:41:35.346137+02:00] until the backend initialization was started. That is of course very long. However, I don't think that any of the components actually misbehaved.
I'm not sure whether the live log is currently capable to show frequent updates in these stages of the test setup.
Updated by okurz 5 months ago
- Status changed from Workable to Resolved
- Assignee set to okurz
Meanwhile more recent jobs could run successfully, e.g. https://openqa.suse.de/tests/14896240 . Jobs were eventually all completed and the slow startup or missing output is a symptom of the mitigations we put in place temporarily.