Project

General

Profile

Actions

action #110196

closed

A big number of tests fail with networking (all workers) due to SLE libslirp0 update

Added by dimstar almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-04-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

This is just a sample test - many more are affected across the board

The SUT seem not to receive DHCP leases, resulting in all kind of test issues.

Rerunning them might or might not work

openQA test in scenario opensuse-Tumbleweed-XFCE-Live-x86_64-xfce-live@64bit fails in
prepare_test_data

Reproducible

Fails since (at least) Build 20220421

Expected result

Last good: 20220420 (or more recent)

Suggestions

  • Look into mitigations for both o3 and osd
  • Ensure root cause issue is fixed
  • Conduct lessons learned

Rollback steps

Further details

Always latest result in this scenario: latest

Actions #1

Updated by dimstar almost 2 years ago

All failed job seem to have run on openqaworker7

Actions #2

Updated by dimstar almost 2 years ago

  • Subject changed from A big number of tests fail with networking to A big number of tests fail with networking (OW7)
Actions #3

Updated by dimstar almost 2 years ago

  • Subject changed from A big number of tests fail with networking (OW7) to A big number of tests fail with networking (all workers)
  • Priority changed from Urgent to Immediate

This has since spread to the other workers, making passing of any test close to impossible

Actions #4

Updated by okurz almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready
Actions #5

Updated by okurz almost 2 years ago

so far I don't know what's the cause. I consider os-autoinst update unlikely as jobs after the latest hotfix were fine and there was no change after that, right? Maybe qemu update or some failing service or configuration?
DimStar
i have seen an dnsmasq update on ariel
okurz[m]
that sounds suspiciuous. Let me check dnsmasq logs
DimStar
i tried reverting that yesterday on ariel, but it dif not make ow7 better (ow1 and 4 still worked yesterday evening)
okurz[m]
oh, ok
DimStar
over night it got worse and all fail
okurz[m]
I rolled back openqaworker to state of 2022-04-21 now, triggering tests on that machine.
DimStar
possible though that the update was applied again

openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2310741 _GROUP=0 TEST=autotest_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=7

Created job #2310798: opensuse-Tumbleweed-DVD-x86_64-autoyast_multi_btrfs:investigate:last_good_tests_and_build:e8e5f0b966a518c15c11a4fc0b03489d3dafec9b+20220420@64bit -> https://openqa.opensuse.org/t2310798

Also reverted dnsmasq on o3

2021-12-06 20:07:30|install|dnsmasq|2.86-7.17.1|x86_64||repo-sle-update|561e2500f84e107c73091df9d0ac94bc8188bb75e32f290a822eacfe0fd0eeed|
2022-04-22 19:10:31|install|dnsmasq|2.86-150100.7.20.1|x86_64||repo-sle-update|5f5da91359421f64fe90696907f125cfcb8780824eadd2eac49c221bbbd780be|
2022-04-22 20:15:13|command|root@ariel|'zypper' 'in' '--oldpackage' 'dnsmasq-2.86-7.17.1'|
2022-04-22 20:15:15|install|dnsmasq|2.86-7.17.1|x86_64|root@ariel|repo-sle-update|561e2500f84e107c73091df9d0ac94bc8188bb75e32f290a822eacfe0fd0eeed|
2022-04-23 00:00:19|install|dnsmasq|2.86-150100.7.20.1|x86_64||repo-sle-update|5f5da91359421f64fe90696907f125cfcb8780824eadd2eac49c221bbbd780be|

and triggered new job
https://openqa.opensuse.org/tests/2310799

Disabled on openqaworker7 openqa-continuous-update.timer with

systemctl disable --now openqa-continuous-update.timer

and installed old version with

zypper in --force /var/cache/zypp/packages/devel_openQA/x86_64/os-autoinst-4.6.1650537502.22e982ce-lp153.1209.1.x86_64.rpm

Triggered another job
https://openqa.opensuse.org/tests/2310800

Takes a bit long to clone a different git version of the test distribution, using a non-git job as template.

openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2310665 _GROUP=0 TEST=autotest_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=openqaworker7

Created job #2310801: opensuse-Tumbleweed-DVD-x86_64-Build20220421-autoyast_multi_btrfs@64bit -> https://openqa.opensuse.org/t2310801

I think the rollback on openqaworker7 was not effective, found some package updates installed again. From /var/log/zypp/history found that also libslirp0 was updated, reverted with

zypper -n in --oldpackage libslirp0-4.3.1-1.51

and triggered another test

 openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2310759 _GROUP=0 TEST=textmode_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=openqaworker7 SCHEDULE=tests/installation/bootloader,tests/installation/welcome,tests/installation/online_repos,tests/installation/installation_mode

Created job #2310803: opensuse-Staging:I-Staging-DVD-x86_64-BuildI.420.1-textmode@64bit -> https://openqa.opensuse.org/t2310803

That test passed.

So libslirp0 is at fault.

The problematic change from the changelog:

* Wed Feb 23 2022 pgajdos@suse.com
- security update
- added patches
  fix CVE-2021-3592 [bsc#1187364], invalid pointer initialization may lead to information disclosure (bootp)
  + libslirp-CVE-2021-3592.patch
  fix CVE-2021-3594 [bsc#1187367], invalid pointer initialization may lead to information disclosure (udp)
  + libslirp-CVE-2021-3594.patch
  fix CVE-2021-3595 [bsc#1187366], invalid pointer initialization may lead to information disclosure (tftp)
  + libslirp-CVE-2021-3595.patch
Actions #6

Updated by okurz almost 2 years ago

Pinning all our machines to that version:

for i in openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "transactional-update run /bin/sh -c 'zypper -n in --oldpackage libslirp0-4.3.1-1.51 && zypper al libslirp0' && reboot || zypper -n in --oldpackage libslirp0-4.3.1-1.51 && zypper al libslirp0" ; done

there is already a bug report https://bugzilla.opensuse.org/show_bug.cgi?id=1198773 which seems to fit.

Actions #7

Updated by okurz almost 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from A big number of tests fail with networking (all workers) to A big number of tests fail with networking (all workers) due to SLE libslirp0 update
  • Description updated (diff)
  • Category deleted (Bugs in existing tests)

It should be ok for o3 now. I pinned the old package version but I have not restarted all affected tests.

After applying the package lock I did on w7 systemctl enable --now openqa-continuous-update.timer again and checked that the service runs fine with systemctl start openqa-continuous-update && journalctl -f openqa-continuous-update.

Also I saw that OSD has the same update installed everywhere as well but have not seen related problems in jobs.
yeah, ok, broken the same, see https://openqa.suse.de/tests/8615929#step/scc_registration/5

Rolled back on OSD workers with

sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'zypper -n in --oldpackage libslirp0-4.3.1-1.51 && zypper al libslirp0'

This did not work on openqaworker14+15 (see #104970) so I did manually sudo zypper -n in --oldpackage libslirp0-4.3.1-1.51 && sudo zypper al libslirp0

Actions #8

Updated by okurz almost 2 years ago

I didn't care to be more specific trying to find failed jobs to restart so let's take out the big hammer:

for host in openqa.suse.de openqa.opensuse.org; do result="result='failed'" host=$host openqa-advanced-retrigger-jobs; done
Actions #9

Updated by okurz almost 2 years ago

  • Category set to Regressions/Crashes

https://build.suse.de/request/show/266342 shows that zluo approved the update with test report https://qam.suse.de/testreports/SUSE:Maintenance:23007:266342/log which explicitly states that no tests have been done by the reviewer which is obviously bad and could have been done better knowing that the purpose of libslirp0 is explicitly networking for qemu and that was not tested.

Actions #10

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-05-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz almost 2 years ago

  • Description updated (diff)
  • Due date deleted (2022-05-08)
  • Status changed from In Progress to Blocked
  • Priority changed from Immediate to Normal
Actions #12

Updated by okurz almost 2 years ago

  • Description updated (diff)
Actions #13

Updated by okurz almost 2 years ago

  • Status changed from Blocked to In Progress

https://bugzilla.suse.com/show_bug.cgi?id=1198773 was resolved, will unlock packages, update and check again.

Actions #14

Updated by okurz almost 2 years ago

  • Tags set to reactive work

On openqaworker7 did

zypper rl libslirp0 && zypper -n in libslirp0
openqa-clone-job --within-instance https://openqa.opensuse.org/tests/2327458 _GROUP=0 TEST=textmode_investigate_poo110196 _SKIP_POST_FAIL_HOOKS=1 WORKER_CLASS=openqaworker7 SCHEDULE=tests/installation/bootloader,tests/installation/welcome,tests/installation/online_repos,tests/installation/installation_mode

-> https://openqa.opensuse.org/tests/2327458

Actions #15

Updated by okurz almost 2 years ago

https://openqa.opensuse.org/tests/2328321# is fine, so doing

for i in openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "zypper rl libslirp0 && transactional-update run /bin/sh -c 'zypper -n in libslirp0' && reboot || zypper -n in libslirp0" ; done

and for OSD

sudo salt --no-color --state-output=changes -C 'G@roles:worker' cmd.run 'zypper rl libslirp0 && zypper -n in libslirp0'
Actions #16

Updated by openqa_review almost 2 years ago

  • Due date set to 2022-05-19

Setting due date based on mean cycle time of SUSE QE Tools

Actions #17

Updated by okurz almost 2 years ago

  • Due date deleted (2022-05-19)
  • Status changed from In Progress to Resolved

All O3+OSD workers up-to-date, no zypper locks in place for libslirp0

Actions

Also available in: Atom PDF