Project

General

Profile

Actions

action #163772

closed

[openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run size:M

Added by waynechen55 16 days ago. Updated 7 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Support
Target version:
Start date:
2024-07-11
Due date:
% Done:

0%

Estimated time:

Description

Observation

Jobs assigned to worker35:x can not really run actually. They all hang, no output from live view and live log, no loaded test modules, although they are in running state.
no_loaded_test_modules
empty_live_view
empyt_live_log

Job 14896205 assigned to worker35:50 in running state and hang
Job 14896209 assigned to worker35:51 in running state and hang
Job 14896235 assigned to worker35:49 in running state and hang

Failures with reason isotovideo died: Unable to clone Git repository 'https://github.com/waynechen55/os-autoinst-distri-opensuse.git#wayne/enable_kernel_log' specified via CASEDIR (see log for details) at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 164. look like so:

[2024-07-11T16:35:17.821676+02:00] [info] [pid:58222] ::: OpenQA::Isotovideo::Utils::clone_git: Cloning git URL 'https://github.com/waynechen55/os-autoinst-distri-opensuse.git' into '/var/lib/openqa/pool/49'[2024-07-11T16:35:17.821776+02:00] [info] [pid:58222] ::: OpenQA::Isotovideo::Utils::clone_git: Checking out git refspec/branch 'wayne/enable_kernel_log'[2024-07-11T16:36:00.214318+02:00] [debug] [pid:58222] Cloning into 'os-autoinst-distri-opensuse'...
error: RPC failed; curl 18 HTTP/2 stream 5 was not closed cleanly before end of the underlying connection
error: 1732 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

Or they fail with reason Reason: backend died: ipmitool -I lanplus -H fibonacci-ipmi.qe.prg2.suse.org -U ADMIN -P [masked] mc guid: Error: Received an Unexpected Open Session Response look like so:

[2024-07-11T16:41:46.020309+02:00] [debug] [pid:63553] Launching external video encoder: ffmpeg -y -hide_banner -nostats -r 24 -f image2pipe -vcodec ppm -i - -pix_fmt yuv420p -c:v libvpx-vp9 -crf 35 -b:v 1500k -cpu-used 1 'video.webm'[2024-07-11T16:41:50.111055+02:00] [info] [pid:63553] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
ipmitool -I lanplus -H fibonacci-ipmi.qe.prg2.suse.org -U ADMIN -P [masked] mc guid: Error: Received an Unexpected Open Session Response
Error: Received an Unexpected Open Session Response
[...]
Error: Received an Unexpected Open Session Response
Error: Unable to establish IPMI v2 / RMCP+ session at /usr/lib/os-autoinst/backend/ipmi.pm line 45.[2024-07-11T16:41:50.111690+02:00] [debug] [pid:63553] Passing remaining frames to the video encoder
[image2pipe @ 0x5568ecfe1480] Could not find codec parameters for stream 0 (Video: ppm, none): unspecified size
Consider increasing the value for the 'analyzeduration' (0) and 'probesize' (5000000) options

Steps to reproduce

  • Trigger ipmi backend job
  • Job assigned to worker35:x

Impact

Can not run job efficiently and effectively

Problem

Looks like problem related to worker process

Acceptance criteria

  • AC1: Worker processes jobs successfully again

Suggestions

  • Check relevant worker process
  • Check worker settings
  • Check other related processes
  • Confirm if this is an issue with the job/worker setup, or there is an underlying bug e.g. assets being slow to download, git sources being slow
  • Confirm if this is one issue or two separate issues and file follow-up tickets as needed

Workaround

n/a


Files

assigned_jobs_hang.png (45.5 KB) assigned_jobs_hang.png waynechen55, 2024-07-11 14:28
assigned_jobs_hang_02.png (43.6 KB) assigned_jobs_hang_02.png waynechen55, 2024-07-11 14:28
assigned_jobs_hang_03.png (7.56 KB) assigned_jobs_hang_03.png waynechen55, 2024-07-11 14:28

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Actions #1

Updated by waynechen55 16 days ago

  • Subject changed from [openQA][worker] Assignded jobs hang and actually can not run to [openQA][worker35:x] Assignded jobs hang and actually can not run
Actions #2

Updated by okurz 16 days ago

  • Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #3

Updated by okurz 16 days ago

  • Subject changed from [openQA][worker35:x] Assignded jobs hang and actually can not run to [openQA][worker35:x] Assigned jobs hang and actually can not run
  • Category set to Support
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #4

Updated by waynechen55 15 days ago

  • Subject changed from [openQA][worker35:x] Assigned jobs hang and actually can not run to [openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run
Actions #5

Updated by mkittler 12 days ago ยท Edited

Considering the jobs ended up as incomplete (Reason: backend died: ipmitool -I lanplus -H quinn-ipmi.qe.prg2.suse.org -U ADMIN -P [masked] mc selftest: Error: Received an Unexpected Open Session Response) I don't think the worker or the web UI is to blame.

I had a closer look at https://openqa.suse.de/tests/14896205. It took over 15 minutes to sync assets. Then it took 6 minutes to sync tests via the cache service. Because CASEDIR was set to a Git URL it took another 6 minutes to clone tests via Git. In total it took from [2024-07-11T16:11:25.079497+02:00] to [2024-07-11T16:41:35.346137+02:00] until the backend initialization was started. That is of course very long. However, I don't think that any of the components actually misbehaved.

I'm not sure whether the live log is currently capable to show frequent updates in these stages of the test setup.

Actions #6

Updated by livdywan 9 days ago

  • Subject changed from [openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run to [openQA][ipmi][worker35:x] Assigned jobs hang and actually can not run size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by okurz 7 days ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz

Meanwhile more recent jobs could run successfully, e.g. https://openqa.suse.de/tests/14896240 . Jobs were eventually all completed and the slow startup or missing output is a symptom of the mitigations we put in place temporarily.

Actions

Also available in: Atom PDF