Project

General

Profile

action #170209

Updated by gpuliti about 1 month ago

# Observation 

 I received an alert mail about an incomplete job: https://openqa.suse.de/tests/15996418 
 It fails with: 

 ``` 
 [2024-11-25T00:19:57.525877Z] [warn] [pid:103270] !!! : qemu-system-x86_64: -vnc :102,share=force-shared: Failed to find an availabale port: Address already in use 
 ``` 

 I asked in [Slack](https://suse.slack.com/archives/C02AJ1E568M/p1732530369296399) (Slack)[https://suse.slack.com/archives/C02AJ1E568M/p1732530369296399] and @tinita observed the same since "a few hours" - apparently all on worker39 

 https://openqa.suse.de/admin/workers/2898 shows that this started about 12 hours ago with this job: https://openqa.suse.de/tests/15995445 
 The `qemu-system-x86_64` process 1963 is running since 12 hours with `-vnc :102,share=force-shared`. 

 ## Acceptance criteria 
 * **AC1:** Affected jobs are restarted automatically 
 * **AC2:** We have a better understanding of situations where this can happen (if at all) 

 # Suggestion 
 * Check one more time for bugs – also consider testing (!) – in the code for handling leftover QEMU processes 
 * Check one more time for bugs – also consider testing (!) – in terminating/killing the process group of isotovideo (in Mojo::…::ReadWriteProcess) 
 * Add/enable debug logging when starting/stopping isotovideo (maybe on ReadWriteProcess level) 
 * Consider starting/stopping isotovideo in a process group with low-level Perl code to replicate the error and investigate and potentially replace the problematic Mojo::ReadWriteProcess?

Back