Actions
action #180110
opencoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"
[sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines
Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
This error message is caused by leftover QEMU processes. This ticket is a continuation of ticket #170209. As part of that ticket we:
- Changed RWP to make sure the whole process group is terminated also if the initial istovideo process isn't running anymore: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/commit/ba7bb383a02c44a3d6340a900fbd8d179942c449 (and the fixup https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/commit/c4c79303145fce0880b9f1697a782840085d3c16)
- This should remove one source of leftover QEMU processes.
- Double-checked the worker self-checks that are supposed to prevent that jobs are assigned to workers in case there are leftover QEMU processes, see #170209#note-47.
- It is still not clear why those self-checks don't work in production.
- Established that the problem still sometimes happens despite these efforts, see #170209#note-44.
Acceptance criteria¶
- AC1: We know why there are sometimes still leftover QEMU processes and RWP is able to terminate them as far as possible.
- AC2: The worker does not run further openQA jobs if there are leftover QEMU processes so we don't end up with incomplete jobs in case a process is stuck for good (and instead an alert fires due to the broken/unavailable worker so we can take care of the situation manually).
Suggestions¶
- Maybe there are more improvements to make in RWP, e.g. fixing some race condition.
- Note that there is also still https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/64 pending but that PR isn't something that will help with this concrete issue.
- There must be something wrong with the self-check. Maybe implementing a fullstack test for that feature would help figuring out what. Maybe spawning multiple worker instances locally using the same pool directory (and hence will conflict with each other) also helps reproducing this issue.
Updated by mkittler about 1 month ago
- Due date set to 2024-11-26
- Start date changed from 2025-04-07 to 2024-11-26
- Follows action #170209: [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M added
Updated by okurz about 1 month ago
- Due date deleted (
2024-11-26) - Category set to Regressions/Crashes
- Target version set to Tools - Next
- Start date deleted (
2024-11-26) - Parent task set to #175515
Updated by okurz about 1 month ago
- Copied to action #180116: Do not run openQA jobs if there are leftover QEMU processes added
Updated by mkittler about 1 month ago
- Has duplicate action #180641: [sporadic] Tests fail with auto_review:"hostfwd.*Could not set up host forwarding rule":retry added
Actions