action #66376
closedMM tests fail in obscure way when tap device is not present
Description
Observation¶
MM tests fail to access other machines in the network if tap device is not present, there is a diag message reported but the test does not fail at this point.
For example
https://openqa.suse.de/tests/4189385#step/setup/42 shows "Failed to connect to 10.0.2.2 port 20243: No route to host" and fails. What is hidden in autoinst-log.txt:
[2020-05-04T08:30:16.586 CEST] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap23 13' : 'tap23' is not connected to bridge 'br1'
but does not fail on this. The worker machine instance is openqaworker-arm-3:24 which should not even have been running as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L570 shows that only 20 worker instances have been configured. The service "openqa-worker@24" is "enabled-runtime", so not "enabled", maybe pulled in manually by "openqa-worker.target".
Acceptance criteria¶
- AC1: multi-machine tests relying on a tap device should fail hard if the tap device is not present or can not be configured
Suggestions¶
- Research why Ettore chose to ignore errors in https://github.com/os-autoinst/os-autoinst/pull/828/files#diff-ce37f278fec25235d9f905442ecd8e5aR150 and just report in a diag message instead
- Try to make the test report clearly about the missing tap device
- Find out why the systemd worker instances on openqaworker-arm-3 are "enabled-runtime" and prevent any workers over the configured amount of instances to be enabled.
Updated by okurz over 4 years ago
# salt -l error --no-color -C 'G@roles:worker and G@osarch:aarch64' cmd.run "systemctl is-active openqa-worker@\* | wc -l"
openqaworker-arm-1.suse.de:
10
openqaworker-arm-3.suse.de:
26
openqaworker-arm-2.suse.de:
30
shows that openqaworker-arm-2 and openqaworker-arm-3 are currently running incorrectly.
As a workaround I have done systemctl mask --now openqa-worker@{21..30}
on openqaworker-arm-3 and ssh openqaworker-arm-2 "sudo systemctl mask --now openqa-worker@{21..30}"
on openqaworker-arm-2.
Updated by okurz over 4 years ago
- Status changed from Workable to Feedback
- Assignee set to okurz
Updated by sebchlad over 4 years ago
Naturally I'm looking at the problem as the "user" who expects that provided workers are ready for multimachine testing.
From this stand point I think:
- support sever - while setting up the priv network - is failing in rather obvious way; I mean it is rather easy to see that the problem is likely on the host side (I do not mean - it is easy to pinpoint exact host problem)
- multimachine test code starting once priv network is set-up: today morning I saw this: https://openqa.suse.de/tests/4186098#step/pdsh_master/6 This problem also is relatively easy to understand: https://openqa.suse.de/tests/4186098#step/pdsh_master/1
I could imagine adding simple module: hpc(mm?)_network_checks.pm
The module would load on all nodes and to stuff like curl to outside etc.
Checks would be fatal and would fail the test immediately.
Wdyt?
Any better ideas?
Is above not needed?
side note: this is still just prevention measure. After all by now checking mm tests for me is easy; easy in the sense of concluding - yeah, restart and ask tools for restarting a worker ;-)
so we should perhaps focus on checking what we could improve on the infrastructure/dev-ops side?
Updated by okurz over 4 years ago
sebchlad wrote:
[…]
I think all checks that are done within the SUT add a layer of complexity that should not be necessary to find out if the prerequisites are fulfilled. And from the referenced failures one can see that something with the network is wrong but I do not see it as "obvious" that it is a problem on the worker side without checking cloned jobs or previous tests. At best we have checks that are not relying on the SUT VM at all, e.g. just code that is directly executed on the worker, not within the SUT VM, like pinging the remote host over the selected network interface.
side note: this is still just prevention measure. After all by now checking mm tests for me is easy; easy in the sense of concluding - yeah, restart and ask tools for restarting a worker ;-)
so we should perhaps focus on checking what we could improve on the infrastructure/dev-ops side?
sure, that is part of the story. Next to the problems regarding the systemd services as in the ticket description I am thinking about what to do in monitoring for better checks. Also I would like more of the logic from salt to be done by upstream openQA or os-autoinst, e.g. see https://github.com/okurz/openQA/blob/feature/setup_mm/script/setup_mm
Updated by nicksinger over 4 years ago
okurz wrote:
- Find out why the systemd worker instances on openqaworker-arm-3 are "enabled-runtime" and prevent any workers over the configured amount of instances to be enabled.
Unfortunately I didn't see anything suspicious on that system regarding the enabled-runtime. According to https://www.freedesktop.org/software/systemd/man/systemctl.html#is-enabled%20UNIT%E2%80%A6 this is caused by somebody starting it with --runtime
just until the next system reboot. Maybe somebody thought we have not enough workers? I can try to dig through some logs but I fear this information might be lost.
Updated by okurz about 4 years ago
Trying to trigger test jobs that I can use to check on the errors, correct handling with fatal errors, crosscheck that everything is ok for correct devices:
I tried to schedule test that go immediately to another job group to not mess up the main dashboard.
Now unfortunately the "isos" route can take a parameter "_GROUP_ID" which behaves different than the same parameter on the "jobs" route as it restricts scheduling to tests that are defined within the specified job group. So we need to go another route:
res=$(openqa-cli api --osd -X post isos ARCH=aarch64 BUILD=okurz_poo66376 DISTRI=sle VERSION=15-SP2 FLAVOR=Online TEST="hpc_DELTA_slurm_accounting_supportserver,hpc_DELTA_slurm_master_accounting,hpc_DELTA_slurm_slave00,hpc_DELTA_slurm_slave01"); echo $res; for i in $(echo $res | jq -r -c .ids | tr '[],' ' '); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
this schedules jobs and moves them to the test development job group. I wonder why this also schedules "create_hdd_hpc_textmode".
Added to openqaworker-arm-2:/etc/openqa/workers.ini:
[40]
WORKER_CLASS=tap,poo66376
[41]
WORKER_CLASS=tap,poo66376
[42]
WORKER_CLASS=tap,poo66376
[43]
WORKER_CLASS=tap,poo66376
[44]
WORKER_CLASS=tap,poo66376
[45]
WORKER_CLASS=tap,poo66376
and started sudo systemctl start openqa-worker@{40..45}
and spawned
res=$(openqa_client_osd --json-output isos post ARCH=aarch64 BUILD=okurz_poo66376 DISTRI=sle VERSION=15-SP2 FLAVOR=Online TEST="hpc_DELTA_slurm_accounting_supportserver,hpc_DELTA_slurm_master_accounting,hpc_DELTA_slurm_slave00,hpc_DELTA_slurm_slave01" WORKER_CLASS=tap,poo66376); echo $res; for i in $(echo $res | jq -r -c .ids | tr '[],' ' '); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
with result { "count" : 5, "failed" : [], "ids" : [ 4461236, 4461237, 4461238, 4461239, 4461240 ], "scheduled_product_id" : 569965 }
so this will probably fail due to missing SLE variables or something but maybe it will also show warnings about missing tap devices in log files. Then I can reschedule with QEMU_FATAL_DBUS_CALL=1
and see if the feedback differs.
Hm. As I did not supply an ISO the test fails early. There is no error about dbus in the logs. I doubt this approach makes sense.
Maybe better to clone wicked jobs:
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
This looks better. https://openqa.suse.de/tests/4462217 is in the test development group now and https://openqa.suse.de/tests/4462217#step/before_test/17 shows that we fail to get access to the other machine as expected. https://openqa.suse.de/tests/4462216/file/autoinst-log.txt shows
[0m[37m[2020-07-17T15:21:17.183 UTC] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap42 16' : 'tap42' is not connected to bridge 'br1'
[0m[37m[2020-07-17T15:21:17.224 UTC] [debug] Failed to run dbus command 'set_vlan' with arguments 'tap106 16' : 'tap106' is not connected to bridge 'br1'
now with fatal dbus calls:
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 QEMU_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
-> https://openqa.suse.de/t4462395
Funny enough, this fails the same way but nothing in logs.
I am just changing code on openqaworker-arm-2 now and testing. Three test cases:
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 QEMU_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
-> https://openqa.suse.de/t4462405 -> good, fatal fail with good message
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 QEMU_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
-> https://openqa.suse.de/t4462407 -> good, passed early steps
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps 4341667 WORKER_CLASS=tap,poo66376 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
-> https://openqa.suse.de/t4462409 -> good, error recorded in log but continues
Now turning this around with QEMU_NON_FATAL_DBUS_CALL, changing source on openqaworker-arm-2 and testing again:
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 WORKER_CLASS=tap,poo66376 QEMU_NON_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 QEMU_NON_FATAL_DBUS_CALL=1 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 WORKER_CLASS=tap,poo66376 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
for i in $(openqa-clone-job --within-instance https://openqa.suse.de --skip-chained-deps --parental-inheritance 4341667 | sed -n 's/Created job #\([0-9]\+\).*$/\1/p'); do openqa-cli api --osd -X put jobs/$i --data '{"group_id": 96}'; done
-> https://openqa.suse.de/tests/4462602 -> good, failed with non-fatal dbus handling
-> https://openqa.suse.de/tests/4462604 -> good, passed with non-fatal dbus handling
-> https://openqa.suse.de/tests/4462605 -> good, incomplete with good error report with fatal dbus handling
-> https://openqa.suse.de/tests/4462608 -> good, passed with fatal dbus handling
https://github.com/os-autoinst/os-autoinst/pull/1477
Stopped additional worker instances on openqaworker-arm-2 again and removed config in /etc/openqa/workers.ini
Now with for i in {1..50}; do echo -e "$i:\t$(sudo systemctl is-enabled openqa-worker@$i)"; done
I can see:
1: enabled
…
20: enabled
21: masked
…
30: masked
31: disabled
…
39: disabled
40: enabled-runtime
…
45: enabled-runtime
46: disabled
…
50: disabled
after initially I did sudo systemctl start openqa-worker@{40..45}
but it shows "enabled-runtime". I can not find any symlinks for these worker instances in neither /etc/systemd/system/ nor /run/systemd/system/ and sudo systemctl disable openqa-worker@{40..50}
as well as sudo systemctl disable --runtime openqa-worker@{40..50}
have no effect. I do not understand this enabled-runtime
. I can do sudo systemctl mask --runtime openqa-worker@{40..45}
to mask them but only in runtime but I still don't understand it. Also "status" reports something different seemingly:
sudo systemctl status openqa-worker@{39..40}
● openqa-worker@39.service - openQA Worker #39
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: inactive (dead)
● openqa-worker@40.service
Loaded: masked (/dev/null; enabled-runtime)
Active: inactive (dead) since Fri 2020-07-17 19:17:42 UTC; 13min ago
Main PID: 43089 (code=exited, status=0/SUCCESS)
so one instance is "loaded (disabled)", the other is "masked (enabled-runtime)"?!? Even sudo systemctl revert openqa-worker@{40..45}
doesn't do a difference here. I researched for all web references I could find, looked through the man pages and github.com/systemd/systemd but the only thing remotely related to this "enabled-runtime" thing I could find is https://github.com/systemd/systemd/issues/9122 which mentions a pull request https://github.com/systemd/systemd/pull/9147 which references systemd tag v239 . In Leap 15.1 and 15.2 we have v234 but in Tumbleweed we would have 245 which might behave different in this context. Maybe we can include a call to mask --runtime
in salt to prevent any superfluous worker instances to be started.
EDIT: 2020-07-21: Crosschecking behaviour on openqaworker12 and also openqaworker-arm-1 shows again differences. On openqaworker-arm-1 I did:
systemctl start openqa-worker@40; systemctl stop openqa-worker@40
but systemctl is-enabled openqa-worker@40
yields disabled
and systemctl status openqa-worker@40
shows
● openqa-worker@40.service - openQA Worker #40
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: inactive (dead)
whereas on openqaworker-arm-2 we have
● openqa-worker@40.service - openQA Worker #40
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: inactive (dead) since Fri 2020-07-17 19:17:42 UTC; 3 days ago
Main PID: 43089 (code=exited, status=0/SUCCESS)
so it differs in "disabled" vs. "enabled-runtime" and also in the case of openqaworker-arm-2 we have a Main PID: 43089 (code=exited, status=0/SUCCESS)
even though in both cases we see in the logs clearly a process that cleanly exited and a PID from logs is also available. Why does it show "Main PID" and "enabled-runtime" in one case and not in the other? And now suddenly openqaworker-arm-2 just died, see https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 , Did I do this?
Updated by okurz about 4 years ago
Testing on openqaworker12 with a local checkout of salt repo(s).
The old code gives for salt-call -l debug --local --state-output=changes state.apply openqa.worker
----------
ID: stop_and_disable_all_workers
Function: cmd.run
Name: systemctl stop openqa-worker@{1..100}; systemctl disable openqa-worker@{1..100}
Result: True
Comment: Command "systemctl stop openqa-worker@{1..100}; systemctl disable openqa-worker@{1..100}" run
Started: 13:32:08.826127
Duration: 844.103 ms
Changes:
----------
pid:
17956
retcode:
0
stderr:
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@2.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@3.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@4.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@5.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@6.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@7.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@8.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@9.service.
Removed /etc/systemd/system/multi-user.target.wants/openqa-worker@10.service.
stdout:
We should keep in mind that this is only triggered when /etc/openqa/workers.ini is changed. We do not want to restart all workers without any changes.
Seems like systemctl disable --now openqa-worker@\*
does not stop "enabled-runtime" workers:
# date; systemctl disable --now openqa-worker@\*; systemctl --no-pager status openqa-worker@1
Wed Jul 29 14:02:59 CEST 2020
● openqa-worker@1.service - openQA Worker #1
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: active (running) since Wed 2020-07-29 13:46:12 CEST; 16min ago
Main PID: 20516 (worker)
Tasks: 1
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@1.service
└─20516 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
Jul 29 13:46:13 openqaworker12 worker[20516]: - pool directory: /var/lib/openqa/pool/1
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] CACHE: caching is enabled, setting up /…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Project dir for host openqa-staging-1.q…/share
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] CACHE: caching is enabled, setting up /…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Project dir for host openqa-staging-2.q…/share
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Establishing ws connection via ws://ope…/ws/11
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [error] [pid:20516] Failed to register at openqa-staging-2…server
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registered and connected via websockets… ID 11
Hint: Some lines were ellipsized, use -l to show in full.
so what looks safer:
# date; systemctl disable --now openqa-worker@\*; systemctl stop openqa-worker@\*; systemctl --no-pager status openqa-worker@1
Wed Jul 29 14:03:53 CEST 2020
● openqa-worker@1.service - openQA Worker #1
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled-runtime; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: inactive (dead) since Wed 2020-07-29 14:03:53 CEST; 26ms ago
Process: 20516 ExecStart=/usr/share/openqa/script/worker --instance 1 (code=exited, status=0/SUCCESS)
Main PID: 20516 (code=exited, status=0/SUCCESS)
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Establishing ws connection via ws://ope…/ws/11
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registering with openQA openqa-staging-…use.de
Jul 29 13:46:14 openqaworker12 worker[20516]: [error] [pid:20516] Failed to register at openqa-staging-2…server
Jul 29 13:46:14 openqaworker12 worker[20516]: [info] [pid:20516] Registered and connected via websockets… ID 11
Jul 29 14:03:53 openqaworker12 systemd[1]: Stopping openQA Worker #1...
Jul 29 14:03:53 openqaworker12 worker[20516]: [info] [pid:20516] Quit due to signal TERM
Jul 29 14:03:53 openqaworker12 worker[20516]: [debug] [pid:20516] Informing openqa-staging-2.qa.suse.de …ffline
Jul 29 14:03:53 openqaworker12 worker[20516]: [debug] [pid:20516] Informing openqa-staging-1.qa.suse.de …ffline
Jul 29 14:03:53 openqaworker12 systemd[1]: Stopped openQA Worker #1.
Hint: Some lines were ellipsized, use -l to show in full.
Updated by okurz about 4 years ago
- Related to action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added
Updated by okurz about 4 years ago
- Related to action #69694: openqa-worker systemd services running in osd which should not be enabled at all and have no tap-device configured auto_review:"backend died:.*tap.*is not connected to bridge.*br1":retry added
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
with new changes deployed https://openqa.suse.de/tests/4535923 shows now how a job can look like if a tap device is not present. The job incompletes and the reason is obvious fulfilling ACs, that is good. Why openqaworker3:15 was running at all because there should have been just 13 instances configured and running I do not know but I can continue in #69694