action #110196
closedA big number of tests fail with networking (all workers) due to SLE libslirp0 update
0%
Description
Observation¶
This is just a sample test - many more are affected across the board
The SUT seem not to receive DHCP leases, resulting in all kind of test issues.
Rerunning them might or might not work
openQA test in scenario opensuse-Tumbleweed-XFCE-Live-x86_64-xfce-live@64bit fails in
prepare_test_data
Reproducible¶
Fails since (at least) Build 20220421
Expected result¶
Last good: 20220420 (or more recent)
Suggestions¶
- Look into mitigations for both o3 and osd
- Ensure root cause issue is fixed
- Conduct lessons learned
Rollback steps¶
- Wait until libslirp0 is fixed
- Remove package lock on all o3+osd machines
- Install updated libslirp0
- Confirm tests are still working fine
- Revert https://github.com/os-autoinst/os-autoinst/pull/2042
Further details¶
Always latest result in this scenario: latest
Updated by dimstar over 2 years ago
All failed job seem to have run on openqaworker7
Updated by dimstar over 2 years ago
- Subject changed from A big number of tests fail with networking to A big number of tests fail with networking (OW7)
Updated by dimstar over 2 years ago
- Subject changed from A big number of tests fail with networking (OW7) to A big number of tests fail with networking (all workers)
- Priority changed from Urgent to Immediate
This has since spread to the other workers, making passing of any test close to impossible
Updated by okurz over 2 years ago
- Status changed from New to In Progress
- Assignee set to okurz
- Target version set to Ready
Updated by okurz over 2 years ago
so far I don't know what's the cause. I consider os-autoinst update unlikely as jobs after the latest hotfix were fine and there was no change after that, right? Maybe qemu update or some failing service or configuration?
DimStar
i have seen an dnsmasq update on ariel
okurz[m]
that sounds suspiciuous. Let me check dnsmasq logs
DimStar
i tried reverting that yesterday on ariel, but it dif not make ow7 better (ow1 and 4 still worked yesterday evening)
okurz[m]
oh, ok
DimStar
over night it got worse and all fail
okurz[m]
I rolled back openqaworker to state of 2022-04-21 now, triggering tests on that machine.
DimStar
possible though that the update was applied again
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2310741 _GROUP=0 TEST=autotest_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=7
Created job #2310798: opensuse-Tumbleweed-DVD-x86_64-autoyast_multi_btrfs:investigate:last_good_tests_and_build:e8e5f0b966a518c15c11a4fc0b03489d3dafec9b+20220420@64bit -> https://openqa.opensuse.org/t2310798
Also reverted dnsmasq on o3
2021-12-06 20:07:30|install|dnsmasq|2.86-7.17.1|x86_64||repo-sle-update|561e2500f84e107c73091df9d0ac94bc8188bb75e32f290a822eacfe0fd0eeed|
2022-04-22 19:10:31|install|dnsmasq|2.86-150100.7.20.1|x86_64||repo-sle-update|5f5da91359421f64fe90696907f125cfcb8780824eadd2eac49c221bbbd780be|
2022-04-22 20:15:13|command|root@ariel|'zypper' 'in' '--oldpackage' 'dnsmasq-2.86-7.17.1'|
2022-04-22 20:15:15|install|dnsmasq|2.86-7.17.1|x86_64|root@ariel|repo-sle-update|561e2500f84e107c73091df9d0ac94bc8188bb75e32f290a822eacfe0fd0eeed|
2022-04-23 00:00:19|install|dnsmasq|2.86-150100.7.20.1|x86_64||repo-sle-update|5f5da91359421f64fe90696907f125cfcb8780824eadd2eac49c221bbbd780be|
and triggered new job
https://openqa.opensuse.org/tests/2310799
Disabled on openqaworker7 openqa-continuous-update.timer
with
systemctl disable --now openqa-continuous-update.timer
and installed old version with
zypper in --force /var/cache/zypp/packages/devel_openQA/x86_64/os-autoinst-4.6.1650537502.22e982ce-lp153.1209.1.x86_64.rpm
Triggered another job
https://openqa.opensuse.org/tests/2310800
Takes a bit long to clone a different git version of the test distribution, using a non-git job as template.
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2310665 _GROUP=0 TEST=autotest_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=openqaworker7
Created job #2310801: opensuse-Tumbleweed-DVD-x86_64-Build20220421-autoyast_multi_btrfs@64bit -> https://openqa.opensuse.org/t2310801
I think the rollback on openqaworker7 was not effective, found some package updates installed again. From /var/log/zypp/history found that also libslirp0 was updated, reverted with
zypper -n in --oldpackage libslirp0-4.3.1-1.51
and triggered another test
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2310759 _GROUP=0 TEST=textmode_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=openqaworker7 SCHEDULE=tests/installation/bootloader,tests/installation/welcome,tests/installation/online_repos,tests/installation/installation_mode
Created job #2310803: opensuse-Staging:I-Staging-DVD-x86_64-BuildI.420.1-textmode@64bit -> https://openqa.opensuse.org/t2310803
That test passed.
So libslirp0 is at fault.
The problematic change from the changelog:
* Wed Feb 23 2022 pgajdos@suse.com
- security update
- added patches
fix CVE-2021-3592 [bsc#1187364], invalid pointer initialization may lead to information disclosure (bootp)
+ libslirp-CVE-2021-3592.patch
fix CVE-2021-3594 [bsc#1187367], invalid pointer initialization may lead to information disclosure (udp)
+ libslirp-CVE-2021-3594.patch
fix CVE-2021-3595 [bsc#1187366], invalid pointer initialization may lead to information disclosure (tftp)
+ libslirp-CVE-2021-3595.patch
Updated by okurz over 2 years ago
Pinning all our machines to that version:
for i in openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "transactional-update run /bin/sh -c 'zypper -n in --oldpackage libslirp0-4.3.1-1.51 && zypper al libslirp0' && reboot || zypper -n in --oldpackage libslirp0-4.3.1-1.51 && zypper al libslirp0" ; done
there is already a bug report https://bugzilla.opensuse.org/show_bug.cgi?id=1198773 which seems to fit.
Updated by okurz over 2 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Subject changed from A big number of tests fail with networking (all workers) to A big number of tests fail with networking (all workers) due to SLE libslirp0 update
- Description updated (diff)
- Category deleted (
Bugs in existing tests)
It should be ok for o3 now. I pinned the old package version but I have not restarted all affected tests.
After applying the package lock I did on w7 systemctl enable --now openqa-continuous-update.timer
again and checked that the service runs fine with systemctl start openqa-continuous-update && journalctl -f openqa-continuous-update
.
Also I saw that OSD has the same update installed everywhere as well but have not seen related problems in jobs.
yeah, ok, broken the same, see https://openqa.suse.de/tests/8615929#step/scc_registration/5
Rolled back on OSD workers with
sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'zypper -n in --oldpackage libslirp0-4.3.1-1.51 && zypper al libslirp0'
This did not work on openqaworker14+15 (see #104970) so I did manually sudo zypper -n in --oldpackage libslirp0-4.3.1-1.51 && sudo zypper al libslirp0
Updated by okurz over 2 years ago
I didn't care to be more specific trying to find failed jobs to restart so let's take out the big hammer:
for host in openqa.suse.de openqa.opensuse.org; do result="result='failed'" host=$host openqa-advanced-retrigger-jobs; done
Updated by okurz over 2 years ago
- Category set to Regressions/Crashes
https://build.suse.de/request/show/266342 shows that zluo approved the update with test report https://qam.suse.de/testreports/SUSE:Maintenance:23007:266342/log which explicitly states that no tests have been done by the reviewer which is obviously bad and could have been done better knowing that the purpose of libslirp0 is explicitly networking for qemu and that was not tested.
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
- Description updated (diff)
- Due date deleted (
2022-05-08) - Status changed from In Progress to Blocked
- Priority changed from Immediate to Normal
https://bugzilla.suse.com/show_bug.cgi?id=1198773 is tracking the product bug.
Updated by okurz over 2 years ago
- Status changed from Blocked to In Progress
https://bugzilla.suse.com/show_bug.cgi?id=1198773 was resolved, will unlock packages, update and check again.
Updated by okurz over 2 years ago
- Tags set to reactive work
On openqaworker7 did
zypper rl libslirp0 && zypper -n in libslirp0
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2327458 _GROUP=0 TEST=textmode_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=openqaworker7 SCHEDULE=tests/installation/bootloader,tests/installation/welcome,tests/installation/online_repos,tests/installation/installation_mode
Updated by okurz over 2 years ago
https://openqa.opensuse.org/tests/2328321# is fine, so doing
for i in openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "zypper rl libslirp0 && transactional-update run /bin/sh -c 'zypper -n in libslirp0' && reboot || zypper -n in libslirp0" ; done
and for OSD
sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'zypper rl libslirp0 && zypper -n in libslirp0'
Updated by openqa_review over 2 years ago
- Due date set to 2022-05-19
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
- Due date deleted (
2022-05-19) - Status changed from In Progress to Resolved
All O3+OSD workers up-to-date, no zypper locks in place for libslirp0