action #33841: [salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1' - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #33841

closed

[salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'

Added by asmorodskyi over 6 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

thehejik

Category:

Target version:

Start date:

2018-03-27

Due date:

% Done:

100%

Estimated time:

Description

Currently ARM jobs sporadically fails with "network down" related issues . In such cases you can find subj. message in logs.

Example - https://openqa.suse.de/tests/1570882/file/autoinst-log.txt

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by asmorodskyi over 6 years ago

Related to action #32296: openvswitch salt receipe is 'unstable' added

Actions

Copy link

Updated by thehejik over 6 years ago

tap33 is not available

Problem is that worker openqaworker-arm-2 is misconfigured - in pillars we have defined for the worker "numofworkers: 30" but actually there are 40 worker processes running.

The tap creation process is using value from numofworkers so we don't have enough tap devices for 40 workers.

For 30 workers defined in pillars we have correct amount of tap devices tap0...tap29. But the host has 40 workers running (dont know a reason for it yet) so it assumes tap0...tap39 and then tap33 is not avail

Problem might be somewhere in salt state or in global: definition in pillars. Or potentially someone just started 40 worker services instead of 30 manually

Actions

Copy link

Updated by szarate over 6 years ago

It seems that the systemd target for openqa-worker.target is not properly updated.

I have updated the services and rebooted the server (For some reason, the Wants and Requires are still the same, above 30 workers.)... doing a daemon reexec or reload did nothing.

If windows solution doesn't calling a technician might help :)

Actions

Copy link

Updated by szarate over 6 years ago

Status changed from New to Feedback

Seems to have worked... Now lets wait

Actions

Copy link

Updated by thehejik over 6 years ago

Subject changed from Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1' to [salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'

More info:
originally the worker openqaworker-arm-2 had 40 workers, but coolo changed it to 30 by https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/5d391f08c3e460ae1cc035481ac73be198c44740

Problem is in openqa/workers.sls salt-state at https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/worker.sls#L171 where we do "watch" for changes in /etc/openqa/workers.ini but actually this file is keeping untouched when the worker is using "global:" settings from pillars data and the amount of "numofworkers" have been changed.

So the salt state openqa-worker@{{ i }} have to be changed to watch something else than workers.ini file.

The problem is not related to poo#32296, in this case openvswitch.sls behaves correctly (tap39 -> tap29 for 30 workers).

Actions

Copy link

Updated by thehejik over 6 years ago

Better subject for this poo would be "[salt] Reflect changes in numofworkers for globally configured workers"

Actions

Copy link

Updated by thehejik over 6 years ago

Status changed from Feedback to In Progress
Assignee set to thehejik
% Done changed from 0 to 90

Fix waiting for merge https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/39

Actions

Copy link

Updated by szarate over 6 years ago

Pr merged :)

Actions

Copy link

Updated by thehejik over 6 years ago

Status changed from In Progress to Resolved
% Done changed from 90 to 100

Actions

Copy link

#10

Updated by thehejik over 6 years ago

Status changed from Resolved to In Progress

Unfortunately the problem is not solved.

Problem is that, when we lower an amount of workers, then the loop at https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/worker.sls#L172 will not stop workers that were started before with higher "@number" than currently stored in numofworkers. The services remained enabled and running.

eg.

numofworkers=20 -> we have 20x openqa-worker@{1..20}
change numofworker to 10
we have still 20x openqa-workers@{1..20} running

This is the current code in worker.sls:

# start services based on numofworkers set in workerconf pillar
{% for i in range(pillar['workerconf'][grains['host']]['numofworkers']) %}
{% set i = i+1 %}
openqa-worker@{{ i }}:
  service.running:
    - enable: True
    - require:
      - pkg: worker-openqa.packages
    - watch:
      - file: /etc/openqa/workers.ini
{% endfor %}

Simplest solution would be to disable and stop all openqa-workers@* and then start the correct amount of worker again.

Something like:

# start services based on numofworkers set in workerconf pillar
{% for i in range(pillar['workerconf'][grains['host']]['numofworkers']) %}
{% set i = i+1 %}
openqa-worker@{{ i }}:
  service.running:
    - enable: True
{% if loop.first %}
    - require:
      - disable_all_workers
    - watch:
      - file: /etc/openqa/workers.ini
{% endif %}
{% endfor %}

disable_all_workers:
  service.dead:
    - name: openqa-worker@*
    - enable: False

Be warned that the code above might have some side-effect.

Actions

Copy link

#11

Updated by thehejik over 6 years ago

% Done changed from 100 to 90

Fix for problem described above https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/40

Actions

Copy link

#12

Updated by okurz over 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_mrsh_supportserver
https://openqa.suse.de/tests/1630114

Actions

Copy link

#13

Updated by thehejik over 6 years ago

Status changed from In Progress to Resolved
% Done changed from 90 to 100

merged

Actions

Copy link

#14

Updated by okurz over 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: hpc_ganglia_supportserver
https://openqa.suse.de/tests/1675894

Actions

Copy link

#15

Updated by thehejik over 6 years ago

Status changed from Resolved to In Progress

Actions

Copy link

#16

Updated by thehejik over 6 years ago

Hopefully I fixed it by performing following commands on workers oqw-arm-1 and oqw-arm-2:
systemctl stop openqa-worker.target
systemctl disable openqa-worker.target
systemctl stop openqa-worker@{1..100}
systemctl disable openqa-worker@{1..100}
systemctl enable openqa-worker@{1..NUMOFWORKERS}
systemctl start openqa-worker@{1..NUMOFWORKERS}

Let see.

Actions

Copy link

#17

Updated by thehejik over 6 years ago

Status changed from In Progress to Resolved

No sign of the problem anymore, closing as resolved, please open in case it will happen again.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #33841

[salt] Failed to run dbus command 'set_vlan' with arguments 'tap33 64' : 'tap33' is not connected to bridge 'br1'

Updated by asmorodskyi over 6 years ago

Updated by thehejik over 6 years ago

Updated by szarate over 6 years ago

Updated by szarate over 6 years ago

Updated by thehejik over 6 years ago

Updated by thehejik over 6 years ago

Updated by thehejik over 6 years ago

Updated by szarate over 6 years ago

Updated by thehejik over 6 years ago

Updated by thehejik over 6 years ago

Updated by thehejik over 6 years ago

Updated by okurz over 6 years ago

Updated by thehejik over 6 years ago

Updated by okurz over 6 years ago

Updated by thehejik over 6 years ago

Updated by thehejik over 6 years ago

Updated by thehejik over 6 years ago