Project

General

Profile

action #75055

grenache-1 can't connect to webui's over IPv4 only

Added by nicksinger 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-10-22
Due date:
% Done:

0%

Estimated time:

Description

due to the ongoing v6 problems we realized that grenache-1 workers disappear one by one if there is no working ipv6 connectivity. This currently results in many blocked jobs since grenache-1 is our main jump host for more exotic testing environments. This ticket is mainly a tracker on what I did to make the workers appear in OSD again:

  • 20.10.2020: problem was realized by workers not connecting to baremetal-support.qa.suse.de
  • 20.10.2020: IPv6 route was missing, created https://infra.nue.suse.com/SelfService/Display.html?id=178626
  • 20.10.2020: IPv6 route was manually added with ip -6 r a fe80::1, after that the worker appeared on all webui's again
  • 21.10.2020: Due to severe performance problems with the workers we decided to remove the v6 route again (details: https://progress.opensuse.org/issues/73633?issue_count=67&issue_position=1&next_issue_id=73501#note-2)
  • 22.10.2020: Several reports stated that grenache-1 workers are once again unavailable. Things I did:
    • Stopped all openqa-worker instances
    • umount /var/lib/openqa/share since it was connected over v6
    • disable v6 completely on the external interface with echo 1 > /proc/sys/net/ipv6/conf/eth0/disable_ipv6
    • mount /var/lib/openqa/share && systemctl start openqa-worker@{1..40}
    • ==> workers came back on OSD. First jobs are running. Reducing priority for now :)

Related issues

Related to openQA Infrastructure - action #73633: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panels but no alert triggered (yet)Resolved2020-10-202020-11-17

Related to openQA Infrastructure - action #75031: [Worker][IPMI] Two openQA workers become offline. openQA jobs stopped running.Resolved2020-10-21

History

#1 Updated by nicksinger 12 months ago

  • Description updated (diff)

#2 Updated by nicksinger 12 months ago

  • Related to action #73633: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panels but no alert triggered (yet) added

#3 Updated by nicksinger 12 months ago

  • Description updated (diff)
  • Priority changed from Urgent to Normal

#4 Updated by okurz 12 months ago

  • Assignee set to nicksinger
  • Target version set to Ready

with that I guess you can also set the ticket to "Blocked" waiting for EngInfra, isn't it?

#5 Updated by nicksinger 12 months ago

  • Related to action #75031: [Worker][IPMI] Two openQA workers become offline. openQA jobs stopped running. added

#6 Updated by nicksinger 12 months ago

  • Status changed from Feedback to Blocked

okurz wrote:

with that I guess you can also set the ticket to "Blocked" waiting for EngInfra, isn't it?

wanted to await some feedback on the performance but blocking it is fine too. If performance issues get reported I can set it to "workable" again anyway.

#7 Updated by okurz 11 months ago

  • Status changed from Blocked to Resolved

https://infra.nue.suse.com/SelfService/Display.html?id=178626 is "Resolved" as well as #73633 . I did a quick ssh malbec 'ping -c 1 -4 openqa.suse.de && ping -c 1 -6 openqa.suse.de' which was successful. This should be good as well.

Also available in: Atom PDF