action #175464
closedcoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"
jobs incomplete with auto_review:"setup failure: isotovideo can not be started"
Added by rainerkoenig 16 days ago. Updated 15 days ago.
Description
Observation¶
Jobs are failing with setup failure: isotovideo can not be started
on OSD e.g. https://openqa.suse.de/tests/16454025 after rsync is done.
Same symptoms on o3 e.g. https://openqa.opensuse.org/tests/4774219 after assets synced successfully.
https://openqa.opensuse.org/tests/4774741 and many others on o3 incomplete with "setup failure: isotovideo can not be started". No more details that seem to be related in autoinst-log.txt nor worker-log.txt
possible regression since 2025-01-13. From /var/log/zypp/history on w26
2025-01-13 09:49:42|install|os-autoinst|4.6.1736759998.a4f72cc-lp156.2000.1|x86_64||devel_openQA|5eea1b72f67fa131e63ac48ecaff2191352a5b3acfe6faf02c82013148ef164f|
2025-01-14 16:17:37|install|os-autoinst|4.6.1736869520.3d40ba7-lp156.2001.1|x86_64||devel_openQA|aa6827c564e410e9893e2bff91270229378cf6d1f882520923782ded1ede257b|
with the only commit included "855fbb60 (okurz/fix/diag_explain) t: Fix hidden output of 'diag explain'" which has changes only in "t/"
openQA-worker maybe? Last good version from https://openqa.opensuse.org/tests/4769715 seems to be crosschecking with /var/log/zypp/history "2025-01-13 15:59:48|install|openQA-worker|4.6.1736782755.0771fde7-lp156.7510.1|x86_64||devel_openQA|e0194139cae23cfdcb5b869bb92962b3106caf041cba9585752756416647cc30|". diff
6f71c5c20 (okurz/feature/openqa_gru, feature/openqa_gru) systemd: Fix premature kill of openqa-gru background processes
5742c2302 Bump eslint-config-prettier from 9.1.0 to 10.0.1
03a921907 Avoid calling `is_running` unnecessarily in `kill`
maybe ReadWriteProcess?
Suggestions¶
- Identify recent changes relevant to the failure and mitigate as a first step
- These are seen at least on grenache and worker24
Rollback steps¶
- DONE Re-enable openQA-auto-update
Updated by rainerkoenig 16 days ago
- Subject changed from setup failure: isotovideo can not be started to Setup failure: isotovideo can not be started on OSD and O3
Currently experiencing massive problems on OSD and O3 because tests report
setup failure: isotovideo can not be started
Example for OSD: https://openqa.suse.de/tests/16454025#
Example for O3: https://openqa.opensuse.org/tests/4774219
Just look at the overview on O3 to see how many tests have that dark red dot.
Updated by livdywan 16 days ago
- Related to action #175470: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added
Updated by okurz 16 days ago
- Has duplicate action #175470: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added
Updated by okurz 16 days ago · Edited
On openqaworker26
zypper se --details perl-Mojo-IOLoop-ReadWriteProcess
…
zypper in --oldpackage perl-Mojo-IOLoop-ReadWriteProcess-0.340.0-bp156.1.1
triggered test with
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4774948 WORKER_CLASS=openqaworker26 _GROUP=0 BUILD= TEST+=-okurz
incompleted the same as before https://openqa.opensuse.org/tests/4774963#
Updated by okurz 16 days ago
- Status changed from New to In Progress
- Assignee set to okurz
openqaworker26:/tmp # snapper ls
Failed to set locale. Fix your system.
# | Type | Pre # | Date | User | Used Space | Cleanup | Description | Userdata
------+--------+-------+--------------------------+------+------------+---------+-----------------------+--------------
0 | single | | | root | | | current |
1* | single | | Fri Aug 4 15:09:15 2023 | root | 2.97 MiB | | first root filesystem |
6166 | pre | | Fri Dec 6 17:35:35 2024 | root | 742.77 MiB | number | zypp(zypper) | important=yes
6167 | post | 6166 | Fri Dec 6 17:36:57 2024 | root | 194.69 MiB | number | | important=yes
6274 | pre | | Fri Dec 13 16:54:45 2024 | root | 189.26 MiB | number | zypp(zypper) | important=yes
6275 | post | 6274 | Fri Dec 13 16:55:34 2024 | root | 1.58 MiB | number | | important=yes
6276 | pre | | Sat Dec 14 03:33:06 2024 | root | 944.00 KiB | number | zypp(zypper) | important=yes
6277 | post | 6276 | Sat Dec 14 03:33:39 2024 | root | 136.96 MiB | number | | important=yes
6316 | pre | | Tue Dec 17 09:32:26 2024 | root | 234.55 MiB | number | zypp(zypper) | important=yes
6317 | post | 6316 | Tue Dec 17 09:33:11 2024 | root | 218.92 MiB | number | | important=yes
6502 | pre | | Fri Jan 3 14:30:06 2025 | root | 266.17 MiB | number | zypp(zypper) | important=yes
6503 | post | 6502 | Fri Jan 3 14:30:54 2025 | root | 213.94 MiB | number | | important=yes
6642 | pre | | Tue Jan 14 12:33:31 2025 | root | 261.70 MiB | number | zypp(zypper) | important=no
6643 | post | 6642 | Tue Jan 14 12:33:46 2025 | root | 368.00 KiB | number | | important=no
6644 | pre | | Tue Jan 14 14:33:18 2025 | root | 352.00 KiB | number | zypp(zypper) | important=no
6645 | post | 6644 | Tue Jan 14 14:33:37 2025 | root | 416.00 KiB | number | | important=no
6646 | pre | | Tue Jan 14 16:17:31 2025 | root | 480.00 KiB | number | zypp(zypper) | important=no
6647 | post | 6646 | Tue Jan 14 16:17:50 2025 | root | 512.00 KiB | number | | important=no
6648 | pre | | Tue Jan 14 17:53:36 2025 | root | 416.00 KiB | number | zypp(zypper) | important=no
6649 | post | 6648 | Tue Jan 14 17:53:50 2025 | root | 384.00 KiB | number | | important=no
6650 | pre | | Tue Jan 14 19:17:31 2025 | root | 336.00 KiB | number | zypp(zypper) | important=no
6651 | post | 6650 | Tue Jan 14 19:17:38 2025 | root | 448.00 KiB | number | | important=no
6652 | pre | | Wed Jan 15 04:24:30 2025 | root | 320.00 KiB | number | zypp(zypper) | important=no
6653 | post | 6652 | Wed Jan 15 04:24:37 2025 | root | 304.00 KiB | number | | important=no
6654 | pre | | Wed Jan 15 04:36:31 2025 | root | 304.00 KiB | number | zypp(zypper) | important=no
6655 | post | 6654 | Wed Jan 15 04:36:37 2025 | root | 304.00 KiB | number | | important=no
6656 | pre | | Wed Jan 15 07:12:31 2025 | root | 304.00 KiB | number | zypp(zypper) | important=no
6657 | post | 6656 | Wed Jan 15 07:12:38 2025 | root | 304.00 KiB | number | | important=no
6658 | pre | | Wed Jan 15 07:54:31 2025 | root | 304.00 KiB | number | zypp(zypper) | important=no
6659 | post | 6658 | Wed Jan 15 07:54:37 2025 | root | 496.00 KiB | number | | important=no
…
openqaworker26:/tmp # snapper rollback 6503
Updated by okurz 16 days ago · Edited
triggered rollback on OSD from https://gitlab.suse.de/openqa/osd-deployment/-/jobs/3659170
w26 job after snapper rollback seems to be fine https://openqa.opensuse.org/tests/4774968#live
Updated by okurz 16 days ago
I assume the faulty packages are part of
2025-01-13 21:10:37|install|os-autoinst-distri-opensuse-deps|1.1736801914.80d1fed9-lp156.16130.1|
2025-01-14 09:51:04|install|os-autoinst-distri-opensuse-deps|1.1736847260.51421743-lp156.16131.1|
2025-01-14 10:37:42|install|openQA-common|4.6.1736849661.1b1680ba-lp156.7511.1
2025-01-14 10:37:42|install|openQA-client|4.6.1736849661.1b1680ba-lp156.7511.1
2025-01-14 10:37:45|install|openQA-worker|4.6.1736849661.1b1680ba-lp156.7511.1
2025-01-14 10:37:47|install|openQA-continuous-update|4.6.1736849661.1b1680ba-lp156.7511.1
2025-01-14 10:37:49|install|openQA-auto-update|4.6.1736849661.1b1680ba-lp156.7511.1
2025-01-14 11:40:36|install|os-autoinst-distri-opensuse-deps|1.1736851565.682d637b-lp156.16133.1|
2025-01-14 11:40:36|install|perl-Mojo-IOLoop-ReadWriteProcess|0.340.0-lp156.45.1|
2025-01-14 12:33:37|install|openQA-common|4.6.1736849661.1b1680ba-lp156.7512.1
2025-01-14 12:33:38|install|openQA-continuous-update|4.6.1736849661.1b1680ba-lp156.7512.1
2025-01-14 12:33:38|install|openQA-client|4.6.1736849661.1b1680ba-lp156.7512.1
2025-01-14 12:33:40|install|openQA-auto-update|4.6.1736849661.1b1680ba-lp156.7512.1
2025-01-14 12:33:45|install|openQA-worker|4.6.1736849661.1b1680ba-lp156.7512.1
2025-01-14 14:33:24|install|openQA-common|4.6.1736863711.05c983e0-lp156.7514.1
2025-01-14 14:33:24|install|os-autoinst-distri-opensuse-deps|1.1736864399.8b8e2d18-lp156.16136.1|
2025-01-14 14:33:25|install|openQA-client|4.6.1736863711.05c983e0-lp156.7514.1
2025-01-14 14:33:32|install|openQA-worker|4.6.1736863711.05c983e0-lp156.7514.1
2025-01-14 14:33:33|install|openQA-continuous-update|4.6.1736863711.05c983e0-lp156.7514.1
2025-01-14 14:33:35|install|openQA-auto-update|4.6.1736863711.05c983e0-lp156.7514.1
2025-01-14 16:17:37|install|os-autoinst|4.6.1736869520.3d40ba7-lp156.2001.1
2025-01-14 16:17:39|install|os-autoinst-openvswitch|4.6.1736869520.3d40ba7-lp156.2001.1
2025-01-14 16:17:39|install|openQA-common|4.6.1736869444.81f8cc9f-lp156.7515.1
2025-01-14 16:17:39|install|openQA-client|4.6.1736869444.81f8cc9f-lp156.7515.1
2025-01-14 16:17:45|install|openQA-worker|4.6.1736869444.81f8cc9f-lp156.7515.1
2025-01-14 16:17:46|install|openQA-continuous-update|4.6.1736869444.81f8cc9f-lp156.7515.1
2025-01-14 16:17:48|install|openQA-auto-update|4.6.1736869444.81f8cc9f-lp156.7515.1
2025-01-14 16:17:48|install|perl-Perl-Tidy|20250105.0.0-lp156.3.1|
2025-01-14 16:17:48|install|os-autoinst-devel|4.6.1736869520.3d40ba7-lp156.2001.1
2025-01-14 17:53:42|install|openQA-common|4.6.1736869444.81f8cc9f-lp156.7517.1
2025-01-14 17:53:44|install|openQA-continuous-update|4.6.1736869444.81f8cc9f-lp156.7517.1
2025-01-14 17:53:44|install|openQA-client|4.6.1736869444.81f8cc9f-lp156.7517.1
2025-01-14 17:53:45|install|openQA-auto-update|4.6.1736869444.81f8cc9f-lp156.7517.1
2025-01-14 17:53:49|install|openQA-worker|4.6.1736869444.81f8cc9f-lp156.7517.1
2025-01-14 19:17:36|install|os-autoinst-distri-opensuse-deps|1.1736881600.3e38d3ba-lp156.16137.1|
2025-01-15 04:24:36|install|os-autoinst-distri-opensuse-deps|1.1736914421.21888f1c-lp156.16138.1|
2025-01-15 04:36:36|install|os-autoinst-distri-opensuse-deps|1.1736915087.9b730b21-lp156.16139.1|
2025-01-15 07:12:36|install|os-autoinst-distri-opensuse-deps|1.1736924291.2cfd3894-lp156.16140.1|
2025-01-15 07:54:36|install|os-autoinst-distri-opensuse-deps|1.1736926855.94bc815c-lp156.16141.1|
2025-01-15 09:06:54|install|perl-Mojo-IOLoop-ReadWriteProcess|0.340.0-bp156.1.1|
Updated by okurz 16 days ago · Edited
Now trying to reproduce again on a different o3 worker but with broken state and then try to partially revert package versions
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/4774948 WORKER_CLASS=openqaworker25 _GROUP=0 BUILD= TEST+=-okurz-w25
1 job has been created:
- openqa-Tumbleweed-dev-x86_64-Build:TW.34154-openqa_install+publish@64bit-2G -> https://openqa.opensuse.org/tests/4774997
incompleted with https://openqa.opensuse.org/tests/4774997#, now reverting openQA packages.
Can't revert. find /var/cache/zypp/ | less
only reveals one package "/var/cache/zypp/packages/devel_openQA/x86_64/openQA-common-4.6.1715785524.aacf9bc3-lp155.6658.1.x86_64.rpm" so there might be a regression from https://github.com/os-autoinst/openQA/pull/6104, reopening #174313
Updated by favogt 16 days ago
Diffing between working and broken snapshots yields this:
--- /.snapshots/6663/snapshot/usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm 2025-01-14 11:14:21.000000000 +0000
+++ /.snapshots/6662/snapshot/usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm 2023-10-17 08:24:30.000000000 +0000
@@ -403,8 +403,7 @@
sub is_running {
my ($self) = shift;
$self->session->consume_collected_info;
- return 0 unless my $pid = $self->process_id;
- kill(0, ($self->kill_whole_group ? -$pid : $pid));
+ $self->process_id ? kill 0 => $self->process_id : 0;
}
sub write_pidfile {
@@ -520,6 +519,7 @@
my $pid = $self->pid;
return $self unless defined $pid;
return $self->_shutdown(1) unless $self->is_running;
+ $self->_diag("Stopping $pid") if DEBUG;
my $ret;
my $attempt = 1;
@@ -527,9 +527,7 @@
my $sleep_time = $self->sleeptime_during_kill;
my $max_attempts = $self->max_kill_attempts;
my $signal = $self->_default_kill_signal;
- $pid = -$pid if $self->kill_whole_group;
- $self->_diag("Stopping $pid") if DEBUG;
-
+ $pid = -getpgrp($pid) if $self->kill_whole_group;
until ((defined $ret && ($ret == $pid || $ret == -1))
|| ($attempt > $max_attempts && $timeout <= 0))
{
i.e.
openqaworker26:~ # rpm --root /.snapshots/6663/snapshot/ -q perl-Mojo-IOLoop-ReadWriteProcess
perl-Mojo-IOLoop-ReadWriteProcess-0.340.0-lp156.45.1.noarch
openqaworker26:~ # rpm --root /.snapshots/6664/snapshot/ -q perl-Mojo-IOLoop-ReadWriteProcess
perl-Mojo-IOLoop-ReadWriteProcess-0.340.0-bp156.1.1.noarch
Updated by okurz 16 days ago
- Copied to action #175482: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" - why did no tests prevent this to be deployed on both o3+osd? added
Updated by okurz 16 days ago
- Description updated (diff)
Thanks @favogt
we verified that downgrading perl-Mojo-IOLoop-ReadWriteProcess works together with restarting openQA workers so doing
hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "zypper -n in --oldpackage perl-Mojo-IOLoop-ReadWriteProcess-0.340.0-bp156.1.1 && systemctl mask openQA-auto-update && reboot" ; done
Updated by livdywan 16 days ago
- Related to action #175494: [openQA][worker][ipmi] isotovideo can not be started added
Updated by livdywan 16 days ago
- Related to deleted (action #175494: [openQA][worker][ipmi] isotovideo can not be started)
Updated by livdywan 16 days ago
- Has duplicate action #175494: [openQA][worker][ipmi] isotovideo can not be started added
Updated by nicksinger 16 days ago
I did:
salt -C 'G@roles:worker' cmd.run 'zypper in -f --oldpackage -r repo-oss perl-Mojo-IOLoop-ReadWriteProcess && bash -c "reboot &"'
on OSD to cover a downgrade of most OSD workers. Unfortunately not all workers have a repo named repo-oss
so @okurz did another try with the explicit version call from https://progress.opensuse.org/issues/175464#note-8
On some hosts (e.g. sapworker1) I also used this command:
zypper lr -Eu | grep -E "leap/15\.[0-9]+/repo/oss/*$" | cut -d "|" -f 1 | awk "{\$1=\$1;print}" | xargs -I{} zypper in -y -n -f --oldpackage -r {} perl-Mojo-IOLoop-ReadWriteProcess
this would work on every worker but I never managed to execute it via salt (too much escaping needed)
Updated by ybonatakis 16 days ago · Edited
okurz wrote in #note-16:
Thanks @favogt
we verified that downgrading perl-Mojo-IOLoop-ReadWriteProcess works together with restarting openQA workers so doing
hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "zypper -n in --oldpackage perl-Mojo-IOLoop-ReadWriteProcess-0.340.0-bp156.1.1 && systemctl mask openQA-auto-update && reboot" ; done
After the downgration of perl-Mojo-IOLoop-ReadWriteProcess I restarted all the O3 jobs I found via dry_run=1 host=ybonatakis@ariel.dmz-prg2.suse.org ./openqa-advanced-retrigger-jobs
. I checked random jobs for isotovideo can not be started
. Not reproduced. other incomplete jobs are still visible but it is due to missing assets
I continue going to the job groups and filter for incompete
. not found any as mostly were in running
/schedule
Updated by okurz 16 days ago
- Related to action #170209: [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M added
Updated by gpathak 16 days ago
- Related to action #175473: OpenQA Jobs test - Incomplete jobs (not restarted) of last 24h alert Salt added
Updated by pcervinka 16 days ago
Job failed just few minutes ago on qesapworker-prg5:23 https://openqa.suse.de/tests/16458701/logfile?filename=worker-log.txt
[2025-01-15T12:02:33.979239Z] [info] [pid:4210] Preparing cgroup to start isotovideo
[2025-01-15T12:02:33.980406Z] [warn] [pid:4210] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 85.
[2025-01-15T12:02:33.980559Z] [info] [pid:4210] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.
[2025-01-15T12:02:33.980914Z] [info] [pid:4210] Starting isotovideo container
[2025-01-15T12:02:33.985090Z] [debug] [pid:4210] Registered process:94385
[2025-01-15T12:02:33.986015Z] [error] [pid:4210] Unable to setup job 16458701: isotovideo can not be started
Updated by okurz 16 days ago
- Copied to action #175518: Conduct "lessons learned" with Five Why analysis for "jobs incomplete with setup failure: isotovideo can not be started" size:S added
Updated by okurz 16 days ago
still recent incompletes with same error as reported, e.g. https://openqa.suse.de/tests/16459246 from w40. I confirmed that w40 has the correct downgraded package version but maybe worker processes have still not been restarted. Trying sudo systemctl restart openqa-worker.target
Called
sudo salt -C 'G@roles:worker' cmd.run 'sudo systemctl restart openqa-worker.target'
and once again
host=openqa.suse.de ./openqa-advanced-retrigger-jobs
Updated by okurz 16 days ago
- Description updated (diff)
- Due date set to 2025-01-29
- Status changed from In Progress to Feedback
Called on ariel
hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "systemctl unmask openQA-auto-update" ; done
I think it would have been "openqa-auto-update" anyway. hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "rpm -q perl-Mojo-IOLoop-ReadWriteProcess" ; done
also looks ok.
I will check the state again in some hours and/or days.
Updated by pcervinka 16 days ago
okurz wrote in #note-33:
Called on ariel
hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "systemctl unmask openQA-auto-update" ; done
what about qesapworker-prg5
?
Updated by okurz 16 days ago
pcervinka wrote in #note-34:
okurz wrote in #note-33:
Called on ariel
hosts="openqaworker21 openqaworker22 openqaworker23 openqaworker24 openqaworker25 openqaworker26 openqaworker27 openqaworker28 openqaworker-arm21 openqaworker-arm22 qa-power8-3"; for i in $hosts; do echo "### $i" && ssh root@$i "systemctl unmask openQA-auto-update" ; done
what about
qesapworker-prg5
?
That's part of OSD and I found that openQA worker processes have not all been consistently restarted explaining your observation from #175464-29, fixed meanwhile. A cloned job from the scenario you referenced passed meanwhile on another worker: https://openqa.suse.de/tests/16458902
qesapworker-prg5 is also ok by now, e.g. see https://openqa.suse.de/tests/16460304
Updated by okurz 15 days ago
- Due date deleted (
2025-01-29) - Status changed from Feedback to Resolved
All still good. Follow-up tickets created. Also select id,test,clone_id,t_started,t_finished from jobs where result='incomplete' and reason like '%isotovideo can not be started%' and t_finished >= '2025-01-15 11:00' and clone_id is null;
did not return any more matches.