Project

General

Profile

action #66376

MM tests fail in obscure way when tap device is not present

Added by okurz 9 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-05-04
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

MM tests fail to access other machines in the network if tap device is not present, there is a diag message reported but the test does not fail at this point.

For example
https://openqa.suse.de/tests/4189385#step/setup/42 shows "Failed to connect to 10.0.2.2 port 20243: No route to host" and fails. What is hidden in autoinst-log.txt:

[2020-05-04T08:30:16.586 CEST] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap23 13' : 'tap23' is not connected to bridge 'br1'

but does not fail on this. The worker machine instance is openqaworker-arm-3:24 which should not even have been running as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L570 shows that only 20 worker instances have been configured. The service "openqa-worker@24" is "enabled-runtime", so not "enabled", maybe pulled in manually by "openqa-worker.target".

Acceptance criteria

  • AC1: multi-machine tests relying on a tap device should fail hard if the tap device is not present or can not be configured

Suggestions


Related issues

Related to openQA Infrastructure - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failuresNew2020-02-26

Related to openQA Infrastructure - action #69694: openqa-worker systemd services running in osd which should not be enabled at all and have no tap-device configured auto_review:"backend died:.*tap.*is not connected to bridge.*br1":retryResolved2020-08-072020-09-01

History

#1 Updated by okurz 9 months ago

# salt -l error --no-color -C  'G@roles:worker and G@osarch:aarch64' cmd.run "systemctl is-active openqa-worker@\* | wc -l"
openqaworker-arm-1.suse.de:
    10
openqaworker-arm-3.suse.de:
    26
openqaworker-arm-2.suse.de:
    30

shows that openqaworker-arm-2 and openqaworker-arm-3 are currently running incorrectly.

As a workaround I have done systemctl mask --now openqa-worker@{21..30} on openqaworker-arm-3 and ssh openqaworker-arm-2 "sudo systemctl mask --now openqa-worker@{21..30}" on openqaworker-arm-2.

#2 Updated by okurz 9 months ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

#3 Updated by sebchlad 9 months ago

Naturally I'm looking at the problem as the "user" who expects that provided workers are ready for multimachine testing.

From this stand point I think:

I could imagine adding simple module: hpc(mm?)_network_checks.pm
The module would load on all nodes and to stuff like curl to outside etc.
Checks would be fatal and would fail the test immediately.
Wdyt?

Any better ideas?
Is above not needed?

side note: this is still just prevention measure. After all by now checking mm tests for me is easy; easy in the sense of concluding - yeah, restart and ask tools for restarting a worker ;-)
so we should perhaps focus on checking what we could improve on the infrastructure/dev-ops side?

#4 Updated by okurz 9 months ago

sebchlad wrote:

[…]

I think all checks that are done within the SUT add a layer of complexity that should not be necessary to find out if the prerequisites are fulfilled. And from the referenced failures one can see that something with the network is wrong but I do not see it as "obvious" that it is a problem on the worker side without checking cloned jobs or previous tests. At best we have checks that are not relying on the SUT VM at all, e.g. just code that is directly executed on the worker, not within the SUT VM, like pinging the remote host over the selected network interface.

side note: this is still just prevention measure. After all by now checking mm tests for me is easy; easy in the sense of concluding - yeah, restart and ask tools for restarting a worker ;-)
so we should perhaps focus on checking what we could improve on the infrastructure/dev-ops side?

sure, that is part of the story. Next to the problems regarding the systemd services as in the ticket description I am thinking about what to do in monitoring for better checks. Also I would like more of the logic from salt to be done by upstream openQA or os-autoinst, e.g. see https://github.com/okurz/openQA/blob/feature/setup_mm/script/setup_mm

#5 Updated by nicksinger 9 months ago

okurz wrote:

  • Find out why the systemd worker instances on openqaworker-arm-3 are "enabled-runtime" and prevent any workers over the configured amount of instances to be enabled.

Unfortunately I didn't see anything suspicious on that system regarding the enabled-runtime. According to https://www.freedesktop.org/software/systemd/man/systemctl.html#is-enabled%20UNIT%E2%80%A6 this is caused by somebody starting it with --runtime just until the next system reboot. Maybe somebody thought we have not enough workers? I can try to dig through some logs but I fear this information might be lost.

#6 Updated by okurz 6 months ago

Trying to trigger test jobs that I can use to check on the errors, correct handling with fatal errors, crosscheck that everything is ok for correct devices:

I tried to schedule test that go immediately to another job group to not mess up the main dashboard.
Now unfortunately the "isos" route can take a parameter "_GROUP_ID" which behaves different than the same parameter on the "jobs" route as it restricts scheduling to tests that are defined within the specified job group. So we need to go another route:

res=$(openqa-cli api --osd -X post isos ARCH=aarch64 BUILD=okurz_poo66376 DISTRI=sle VERSION=15-SP2 FLAVOR=Online TEST="hpc_DELTA_slurm_accounting_supportserver,hpc_DELTA_slurm_master_accounting,hpc_DELTA_slurm_slave00,hpc_DELTA_slurm_slave01"); echo $res; for i in $(echo $res | jq -r -c .ids | tr '[],' ' '); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

this schedules jobs and moves them to the test development job group. I wonder why this also schedules "create_hdd_hpc_textmode".

Added to openqaworker-arm-2:/etc/openqa/workers.ini:

[40]
WORKER_CLASS=tap,poo66376

[41]
WORKER_CLASS=tap,poo66376

[42]
WORKER_CLASS=tap,poo66376

[43]
WORKER_CLASS=tap,poo66376

[44]
WORKER_CLASS=tap,poo66376

[45]
WORKER_CLASS=tap,poo66376

and started sudo systemctl start openqa-worker@{40..45} and spawned

res=$(openqa_client_osd --json-output isos post ARCH=aarch64 BUILD=okurz_poo66376 DISTRI=sle VERSION=15-SP2 FLAVOR=Online TEST="hpc_DELTA_slurm_accounting_supportserver,hpc_DELTA_slurm_master_accounting,hpc_DELTA_slurm_slave00,hpc_DELTA_slurm_slave01" WORKER_CLASS=tap,poo66376); echo $res; for i in $(echo $res | jq -r -c .ids | tr '[],' ' '); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

with result { "count" : 5, "failed" : [], "ids" : [ 4461236, 4461237, 4461238, 4461239, 4461240 ], "scheduled_product_id" : 569965 }

so this will probably fail due to missing SLE variables or something but maybe it will also show warnings about missing tap devices in log files. Then I can reschedule with QEMU_FATAL_DBUS_CALL=1 and see if the feedback differs.

Hm. As I did not supply an ISO the test fails early. There is no error about dbus in the logs. I doubt this approach makes sense.

Maybe better to clone wicked jobs:

for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

This looks better. https://openqa.suse.de/tests/4462217 is in the test development group now and https://openqa.suse.de/tests/4462217#step/before_test/17 shows that we fail to get access to the other machine as expected. https://openqa.suse.de/tests/4462216/file/autoinst-log.txt shows

[2020-07-17T15:21:17.183 UTC] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap42 16' : 'tap42' is not connected to bridge 'br1'
[2020-07-17T15:21:17.224 UTC] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap106 16' : 'tap106' is not connected to bridge 'br1'

now with fatal dbus calls:

for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 QEMU_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

-> https://openqa.suse.de/t4462395

Funny enough, this fails the same way but nothing in logs.

I am just changing code on openqaworker-arm-2 now and testing. Three test cases:

for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 QEMU_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

-> https://openqa.suse.de/t4462405 -> good, fatal fail with good message

for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 QEMU_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

-> https://openqa.suse.de/t4462407 -> good, passed early steps

for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

-> https://openqa.suse.de/t4462409 -> good, error recorded in log but continues

Now turning this around with QEMU_NON_FATAL_DBUS_CALL, changing source on openqaworker-arm-2 and testing again:

for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 WORKER_CLASS=tap,poo66376 QEMU_NON_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 QEMU_NON_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 WORKER_CLASS=tap,poo66376 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done

-> https://openqa.suse.de/tests/4462602 -> good, failed with non-fatal dbus handling
-> https://openqa.suse.de/tests/4462604 -> good, passed with non-fatal dbus handling
-> https://openqa.suse.de/tests/4462605 -> good, incomplete with good error report with fatal dbus handling
-> https://openqa.suse.de/tests/4462608 -> good, passed with fatal dbus handling

https://github.com/os-autoinst/os-autoinst/pull/1477

Stopped additional worker instances on openqaworker-arm-2 again and removed config in /etc/openqa/workers.ini

Now with for i in {1..50}; do echo -e "$i:\t$(sudo systemctl is-enabled openqa-worker@$i)"; done I can see:

1:      enabled
…
20:     enabled
21:     masked
…
30:     masked
31:     disabled
…
39:     disabled
40:     enabled-runtime
…
45:     enabled-runtime
46:     disabled
…
50:     disabled

after initially I did sudo systemctl start openqa-worker@{40..45} but it shows "enabled-runtime". I can not find any symlinks for these worker instances in neither /etc/systemd/system/ nor /run/systemd/system/ and sudo systemctl disable openqa-worker@{40..50} as well as sudo systemctl disable --runtime openqa-worker@{40..50} have no effect. I do not understand this enabled-runtime. I can do sudo systemctl mask --runtime openqa-worker@{40..45} to mask them but only in runtime but I still don't understand it. Also "status" reports something different seemingly:

sudo systemctl status openqa-worker@{39..40}
● openqa-worker@39.service - openQA Worker #39
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: inactive (dead)

● openqa-worker@40.service
   Loaded: masked (/dev/null; enabled-runtime)
   Active: inactive (dead) since Fri 2020-07-17 19:17:42 UTC; 13min ago
 Main PID: 43089 (code=exited, status=0/SUCCESS)

so one instance is "loaded (disabled)", the other is "masked (enabled-runtime)"?!? Even sudo systemctl revert openqa-worker@{40..45} doesn't do a difference here. I researched for all web references I could find, looked through the man pages and github.com/systemd/systemd but the only thing remotely related to this "enabled-runtime" thing I could find is https://github.com/systemd/systemd/issues/9122 which mentions a pull request https://github.com/systemd/systemd/pull/9147 which references systemd tag v239 . In Leap 15.1 and 15.2 we have v234 but in Tumbleweed we would have 245 which might behave different in this context. Maybe we can include a call to mask --runtime in salt to prevent any superfluous worker instances to be started.

EDIT: 2020-07-21: Crosschecking behaviour on openqaworker12 and also openqaworker-arm-1 shows again differences. On openqaworker-arm-1 I did:

systemctl start openqa-worker@40; systemctl stop openqa-worker@40

but systemctl is-enabled openqa-worker@40 yields disabled and systemctl status openqa-worker@40 shows

● openqa-worker@40.service - openQA Worker #40
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: inactive (dead)

whereas on openqaworker-arm-2 we have

● openqa-worker@40.service - openQA Worker #40
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: inactive (dead) since Fri 2020-07-17 19:17:42 UTC; 3 days ago
 Main PID: 43089 (code=exited, status=0/SUCCESS)

so it differs in "disabled" vs. "enabled-runtime" and also in the case of openqaworker-arm-2 we have a Main PID: 43089 (code=exited, status=0/SUCCESS) even though in both cases we see in the logs clearly a process that cleanly exited and a PID from logs is also available. Why does it show "Main PID" and "enabled-runtime" in one case and not in the other? And now suddenly openqaworker-arm-2 just died, see https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 , Did I do this?

#7 Updated by okurz 6 months ago

Testing on openqaworker12 with a local checkout of salt repo(s).

The old code gives for salt-call -l debug --local --state-output=changes state.apply openqa.worker

----------
          ID: stop_and_disable_all_workers
    Function: cmd.run
        Name: systemctl stop openqa-worker@{1..100}; systemctl disable openqa-worker@{1..100}
      Result: True
     Comment: Command "systemctl stop openqa-worker@{1..100}; systemctl disable openqa-worker@{1..100}" run
     Started: 13:32:08.826127
    Duration: 844.103 ms
     Changes:   
              ----------
              pid:
                  17956
              retcode:
                  0
              stderr:
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@2.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@3.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@4.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@5.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@6.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@7.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@8.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@9.service.
                  Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@10.service.
              stdout:

We should keep in mind that this is only triggered when /etc/openqa/workers.ini is changed. We do not want to restart all workers without any changes.

Seems like systemctl disable --now openqa-worker@\* does not stop "enabled-runtime" workers:

# date; systemctl disable --now openqa-worker@\*; systemctl --no-pager status openqa-worker@1
Wed Jul 29 14:02:59 CEST 2020
● openqa-worker@1.service - openQA Worker #1
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: active (running) since Wed 2020-07-29 13:46:12 CEST; 16min ago
 Main PID: 20516 (worker)
    Tasks: 1
   CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@1.service
           └─20516 /usr/bin/perl /usr/share/openqa/script/worker --instance 1

Jul 29 13:46:13 openqaworker12 worker[20516]:  - pool directory:        /var/lib/openqa/pool/1
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] CACHE: caching is enabled, setting up /…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Project dir for host openqa-staging-1.q…/share
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] CACHE: caching is enabled, setting up /…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Project dir for host openqa-staging-2.q…/share
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Establishing ws connection via ws://ope…/ws/11
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [error] [pid:20516] Failed to register at openqa-staging-2…server
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registered and connected via websockets… ID 11
Hint: Some lines were ellipsized, use -l to show in full.

so what looks safer:

# date; systemctl disable --now openqa-worker@\*; systemctl stop openqa-worker@\*; systemctl --no-pager status openqa-worker@1
Wed Jul 29 14:03:53 CEST 2020
● openqa-worker@1.service - openQA Worker #1
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: inactive (dead) since Wed 2020-07-29 14:03:53 CEST; 26ms ago
  Process: 20516 ExecStart=/usr/share/openqa/script/worker --instance 1 (code=exited, status=0/SUCCESS)
 Main PID: 20516 (code=exited, status=0/SUCCESS)

Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Establishing ws connection via ws://ope…/ws/11
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [error] [pid:20516] Failed to register at openqa-staging-2…server
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registered and connected via websockets… ID 11
Jul 29 14:03:53 openqaworker12 systemd[1]: Stopping openQA Worker #1...
Jul 29 14:03:53 openqaworker12 worker[20516]: [info] [pid:20516] Quit due to signal TERM
Jul 29 14:03:53 openqaworker12 worker[20516]: [debug] [pid:20516] Informing openqa-staging-2.qa.suse.de …ffline
Jul 29 14:03:53 openqaworker12 worker[20516]: [debug] [pid:20516] Informing openqa-staging-1.qa.suse.de …ffline
Jul 29 14:03:53 openqaworker12 systemd[1]: Stopped openQA Worker #1.
Hint: Some lines were ellipsized, use -l to show in full.

#8 Updated by okurz 6 months ago

  • Related to action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added

#9 Updated by okurz 6 months ago

  • Related to action #69694: openqa-worker systemd services running in osd which should not be enabled at all and have no tap-device configured auto_review:"backend died:.*tap.*is not connected to bridge.*br1":retry added

#10 Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

with new changes deployed https://openqa.suse.de/tests/4535923 shows now how a job can look like if a tap device is not present. The job incompletes and the reason is obvious fulfilling ACs, that is good. Why openqaworker3:15 was running at all because there should have been just 13 instances configured and running I do not know but I can continue in #69694

Also available in: Atom PDF