action #180110: [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines - openQA Project (public) - openSUSE Project Management Tool

Actions

action #180110

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"

Status:

New

Priority:

Normal

Assignee:

Category:

Regressions/Crashes

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

This error message is caused by leftover QEMU processes. This ticket is a continuation of ticket #170209. As part of that ticket we:

Changed RWP to make sure the whole process group is terminated also if the initial istovideo process isn't running anymore: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/commit/ba7bb383a02c44a3d6340a900fbd8d179942c449 (and the fixup https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/commit/c4c79303145fce0880b9f1697a782840085d3c16)
- This should remove one source of leftover QEMU processes.
Double-checked the worker self-checks that are supposed to prevent that jobs are assigned to workers in case there are leftover QEMU processes, see #170209#note-47.
- It is still not clear why those self-checks don't work in production.
Established that the problem still sometimes happens despite these efforts, see #170209#note-44.

AC1: We know why there are sometimes still leftover QEMU processes and RWP is able to terminate them as far as possible.
AC2: The worker does not run further openQA jobs if there are leftover QEMU processes so we don't end up with incomplete jobs in case a process is stuck for good (and instead an alert fires due to the broken/unavailable worker so we can take care of the situation manually).

Maybe there are more improvements to make in RWP, e.g. fixing some race condition.
- Note that there is also still https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/64 pending but that PR isn't something that will help with this concrete issue.
There must be something wrong with the self-check. Maybe implementing a fullstack test for that feature would help figuring out what. Maybe spawning multiple worker instances locally using the same pool directory (and hence will conflict with each other) also helps reproducing this issue.

Related issues 3 (1 open — 2 closed)

Actions

Due date set to 2024-11-26
Start date changed from 2025-04-07 to 2024-11-26
Follows action #170209: [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M added