action #33841
closed[salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'
100%
Description
Currently ARM jobs sporadically fails with "network down" related issues . In such cases you can find subj. message in logs.
Example - https://openqa.suse.de/tests/1570882/file/autoinst-log.txt
Updated by asmorodskyi over 6 years ago
- Related to action #32296: openvswitch salt receipe is 'unstable' added
Updated by thehejik over 6 years ago
tap33 is not available
Problem is that worker openqaworker-arm-2 is misconfigured - in pillars we have defined for the worker "numofworkers: 30" but actually there are 40 worker processes running.
The tap creation process is using value from numofworkers so we don't have enough tap devices for 40 workers.
For 30 workers defined in pillars we have correct amount of tap devices tap0...tap29. But the host has 40 workers running (dont know a reason for it yet) so it assumes tap0...tap39 and then tap33 is not avail
Problem might be somewhere in salt state or in global: definition in pillars. Or potentially someone just started 40 worker services instead of 30 manually
Updated by szarate over 6 years ago
It seems that the systemd target for openqa-worker.target is not properly updated.
I have updated the services and rebooted the server (For some reason, the Wants and Requires are still the same, above 30 workers.)... doing a daemon reexec or reload did nothing.
If windows solution doesn't calling a technician might help :)
Updated by szarate over 6 years ago
- Status changed from New to Feedback
Seems to have worked... Now lets wait
Updated by thehejik over 6 years ago
- Subject changed from Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1' to [salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'
More info:
originally the worker openqaworker-arm-2 had 40 workers, but coolo changed it to 30 by https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/5d391f08c3e460ae1cc035481ac73be198c44740
Problem is in openqa/workers.sls salt-state at https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/worker.sls#L171 where we do "watch" for changes in /etc/openqa/workers.ini but actually this file is keeping untouched when the worker is using "global:" settings from pillars data and the amount of "numofworkers" have been changed.
So the salt state openqa-worker@{{ i }} have to be changed to watch something else than workers.ini file.
The problem is not related to poo#32296, in this case openvswitch.sls behaves correctly (tap39 -> tap29 for 30 workers).
Updated by thehejik over 6 years ago
Better subject for this poo would be "[salt] Reflect changes in numofworkers for globally configured workers"
Updated by thehejik over 6 years ago
- Status changed from Feedback to In Progress
- Assignee set to thehejik
- % Done changed from 0 to 90
Fix waiting for merge https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/39
Updated by thehejik over 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
Updated by thehejik over 6 years ago
- Status changed from Resolved to In Progress
Unfortunately the problem is not solved.
Problem is that, when we lower an amount of workers, then the loop at https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/worker.sls#L172 will not stop workers that were started before with higher "@number" than currently stored in numofworkers. The services remained enabled and running.
eg.
- numofworkers=20 -> we have 20x openqa-worker@{1..20}
- change numofworker to 10
- we have still 20x openqa-workers@{1..20} running
This is the current code in worker.sls:
# start services based on numofworkers set in workerconf pillar
{% for i in range(pillar['workerconf'][grains['host']]['numofworkers']) %}
{% set i = i+1 %}
openqa-worker@{{ i }}:
service.running:
- enable: True
- require:
- pkg: worker-openqa.packages
- watch:
- file: /etc/openqa/workers.ini
{% endfor %}
Simplest solution would be to disable and stop all openqa-workers@* and then start the correct amount of worker again.
Something like:
# start services based on numofworkers set in workerconf pillar
{% for i in range(pillar['workerconf'][grains['host']]['numofworkers']) %}
{% set i = i+1 %}
openqa-worker@{{ i }}:
service.running:
- enable: True
{% if loop.first %}
- require:
- disable_all_workers
- watch:
- file: /etc/openqa/workers.ini
{% endif %}
{% endfor %}
disable_all_workers:
service.dead:
- name: openqa-worker@*
- enable: False
Be warned that the code above might have some side-effect.
Updated by thehejik over 6 years ago
- % Done changed from 100 to 90
Fix for problem described above https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/40
Updated by okurz over 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: hpc_mrsh_supportserver
https://openqa.suse.de/tests/1630114
Updated by thehejik over 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
merged
Updated by okurz over 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: hpc_ganglia_supportserver
https://openqa.suse.de/tests/1675894
Updated by thehejik over 6 years ago
- Status changed from Resolved to In Progress
Updated by thehejik over 6 years ago
Hopefully I fixed it by performing following commands on workers oqw-arm-1 and oqw-arm-2:
systemctl stop openqa-worker.target
systemctl disable openqa-worker.target
systemctl stop openqa-worker@{1..100}
systemctl disable openqa-worker@{1..100}
systemctl enable openqa-worker@{1..NUMOFWORKERS}
systemctl start openqa-worker@{1..NUMOFWORKERS}
Let see.
Updated by thehejik over 6 years ago
- Status changed from In Progress to Resolved
No sign of the problem anymore, closing as resolved, please open in case it will happen again.