Project

General

Profile

action #33841

[salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'

Added by asmorodskyi over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2018-03-27
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Currently ARM jobs sporadically fails with "network down" related issues . In such cases you can find subj. message in logs.

Example - https://openqa.suse.de/tests/1570882/file/autoinst-log.txt


Related issues

Related to openQA Infrastructure - action #32296: openvswitch salt receipe is 'unstable'Resolved2018-02-26

History

#1 Updated by asmorodskyi over 3 years ago

  • Related to action #32296: openvswitch salt receipe is 'unstable' added

#2 Updated by thehejik over 3 years ago

tap33 is not available

Problem is that worker openqaworker-arm-2 is misconfigured - in pillars we have defined for the worker "numofworkers: 30" but actually there are 40 worker processes running.

The tap creation process is using value from numofworkers so we don't have enough tap devices for 40 workers.

For 30 workers defined in pillars we have correct amount of tap devices tap0...tap29. But the host has 40 workers running (dont know a reason for it yet) so it assumes tap0...tap39 and then tap33 is not avail

Problem might be somewhere in salt state or in global: definition in pillars. Or potentially someone just started 40 worker services instead of 30 manually

#3 Updated by szarate over 3 years ago

It seems that the systemd target for openqa-worker.target is not properly updated.

I have updated the services and rebooted the server (For some reason, the Wants and Requires are still the same, above 30 workers.)... doing a daemon reexec or reload did nothing.

If windows solution doesn't calling a technician might help :)

#4 Updated by szarate over 3 years ago

  • Status changed from New to Feedback

Seems to have worked... Now lets wait

#5 Updated by thehejik over 3 years ago

  • Subject changed from Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1' to [salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'

More info:
originally the worker openqaworker-arm-2 had 40 workers, but coolo changed it to 30 by https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/5d391f08c3e460ae1cc035481ac73be198c44740

Problem is in openqa/workers.sls salt-state at https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/worker.sls#L171 where we do "watch" for changes in /etc/openqa/workers.ini but actually this file is keeping untouched when the worker is using "global:" settings from pillars data and the amount of "numofworkers" have been changed.

So the salt state openqa-worker@{{ i }} have to be changed to watch something else than workers.ini file.

The problem is not related to poo#32296, in this case openvswitch.sls behaves correctly (tap39 -> tap29 for 30 workers).

#6 Updated by thehejik over 3 years ago

Better subject for this poo would be "[salt] Reflect changes in numofworkers for globally configured workers"

#7 Updated by thehejik over 3 years ago

  • Status changed from Feedback to In Progress
  • Assignee set to thehejik
  • % Done changed from 0 to 90

#8 Updated by szarate over 3 years ago

Pr merged :)

#9 Updated by thehejik over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

#10 Updated by thehejik over 3 years ago

  • Status changed from Resolved to In Progress

Unfortunately the problem is not solved.

Problem is that, when we lower an amount of workers, then the loop at https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/worker.sls#L172 will not stop workers that were started before with higher "@number" than currently stored in numofworkers. The services remained enabled and running.

eg.

  • numofworkers=20 -> we have 20x openqa-worker@{1..20}
  • change numofworker to 10
  • we have still 20x openqa-workers@{1..20} running

This is the current code in worker.sls:

# start services based on numofworkers set in workerconf pillar
{% for i in range(pillar['workerconf'][grains['host']]['numofworkers']) %}
{% set i = i+1 %}
openqa-worker@{{ i }}:
  service.running:
    - enable: True
    - require:
      - pkg: worker-openqa.packages
    - watch:
      - file: /etc/openqa/workers.ini
{% endfor %}

Simplest solution would be to disable and stop all openqa-workers@* and then start the correct amount of worker again.

Something like:

# start services based on numofworkers set in workerconf pillar
{% for i in range(pillar['workerconf'][grains['host']]['numofworkers']) %}
{% set i = i+1 %}
openqa-worker@{{ i }}:
  service.running:
    - enable: True
{% if loop.first %}
    - require:
      - disable_all_workers
    - watch:
      - file: /etc/openqa/workers.ini
{% endif %}
{% endfor %}

disable_all_workers:
  service.dead:
    - name: openqa-worker@*
    - enable: False

Be warned that the code above might have some side-effect.

#11 Updated by thehejik over 3 years ago

  • % Done changed from 100 to 90

#12 Updated by okurz over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_mrsh_supportserver
https://openqa.suse.de/tests/1630114

#13 Updated by thehejik over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

merged

#14 Updated by okurz over 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_ganglia_supportserver
https://openqa.suse.de/tests/1675894

#15 Updated by thehejik over 3 years ago

  • Status changed from Resolved to In Progress

#16 Updated by thehejik over 3 years ago

Hopefully I fixed it by performing following commands on workers oqw-arm-1 and oqw-arm-2:
systemctl stop openqa-worker.target
systemctl disable openqa-worker.target
systemctl stop openqa-worker@{1..100}
systemctl disable openqa-worker@{1..100}
systemctl enable openqa-worker@{1..NUMOFWORKERS}
systemctl start openqa-worker@{1..NUMOFWORKERS}

Let see.

#17 Updated by thehejik over 3 years ago

  • Status changed from In Progress to Resolved

No sign of the problem anymore, closing as resolved, please open in case it will happen again.

Also available in: Atom PDF