No monitoring data from OSD since 2024-11-25 1449Z size:M

Added by okurz 7 days ago. Updated about 8 hours ago.

In Progress
Start date:
Due date:
2024-12-12 (Due in 8 days)
Acceptance criteria

  • AC1: There is current monitoring data from OSD itself on
  • AC2: There is also monitoring data after reboots of monitor+OSD

Acceptance tests


  • Handle IPv4+IPv6 double routing problems after setting up wireguard tunnels disrupting also our monitoring
  • Understand what approach to take for routing with VPN in place and consider both source and target hosts for communication
  • Might need changes to multiple hosts
  • Make changes persistent in salt
  • Ensure reboot consistency

Rollback actions

Updated by okurz 7 days ago

  • Related to action #169564: Configure wireguard tunnels on OSD production hosts needed for openQA located in the NUE2 server room size:S added
Updated by nicksinger 7 days ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Updated by nicksinger 7 days ago

  • Status changed from In Progress to Blocked

Apparently we misunderstood enginfra and the wg-tunnel does not allow us back-and-forth communication. Even with the tunnel in place the connection has to be established from within a CC-compliant area. Given our monitoring infra pushes data to the monitoring host, this cannot work.

Updated by nicksinger 7 days ago

  • Status changed from Blocked to In Progress
Updated by nicksinger 7 days ago

  • Priority changed from Urgent to Normal

Together with Robert we found out that "asymmetric routing" is our problem. A CC-system like OSD reaches monitor via the usual network (eth0, -> but not over the wg ip of that system (in case of monitor, This outgoing connection then "punches" a hole into the firewall allowing the non-CC system to reach OSD just for the answering packet. However, based on how wg sets up routes on non-CC systems (routing all CC traffic trough the tunnel because that is the whole point of the wg tunnel), causes these systems so send back their answer via the tunnel. So OSD sends out a ping via eth0 but receives the answer with a wrong source IP.

The solution is to ensure that non-CC systems always answer packets on the interface they receive them. A way to implement this is - based on that article I did the following on monitor:

echo "240     nowg" >> /etc/iproute2/rt_tables
ip route add default via dev eth0 table nowg #taken from `ip r s default`
ip rule add from table nowg prio 1

This appears to work and we have data back in grafana:$__all&refresh=15m

This is just a temporary workaround and will be gone after a reboot so we need to find a way to make it a) reboot safe and b) deploy it via salt on all machines.

Updated by openqa_review 6 days ago

  • Due date set to 2024-12-12

Setting due date based on mean cycle time of SUSE QE Tools

Updated by okurz 6 days ago

  • Parent task set to #166598
Updated by okurz 6 days ago

  • Priority changed from Normal to High
Actions #9

Updated by jbaier_cz 6 days ago

  • Related to action #170473: not reachable from mania:2 size:S added
Updated by nicksinger 5 days ago

  • Copied to action #170494: nginx.service on monitor failed because of: "No such file or directory:calling fopen(/etc/dehydrated/certs/" added
Updated by nicksinger 5 days ago adds a service which can be used as hook script in wicked. To add this hook script to our default interfaces I had to fix a default-grain we used in the past:

Updated by nicksinger 5 days ago

nicksinger wrote in #note-11: adds a service which can be used as hook script in wicked. To add this hook script to our default interfaces I had to fix a default-grain we used in the past:

I think made v6 work again on some machines and now causes the ping to fail this way. Fortunately the fix is similar to v4 and I will extend the other MR to cover both.

Updated by okurz 3 days ago

  • Priority changed from High to Urgent

who would have thought that a reboot of caused issues after not all changes are made persistent yet ;) See$__all&refresh=15m

Actions #14

Updated by gpuliti 2 days ago

  • Description updated (diff)
Updated by mkittler about 20 hours ago

  • Subject changed from No monitoring data from OSD since 2024-11-25 1449Z to No monitoring data from OSD since 2024-11-25 1449Z size:M
  • Description updated (diff)
Updated by nicksinger about 16 hours ago ยท Edited

I've added more necessary changes to but unfortunately failed right at the last step: for some reason wicked is not executing the service POST_UP. I've asked in Slack and already found out that our wicked gets "stuck" somewhere and never reaches the POST-phase (despite everything working). I might implement a workaround if needed.

Updated by nicksinger about 8 hours ago

Turns out that our BOOTPROTO was set not quite right.

The documentation states:

The setup is considered successful, when at least one dhcp client configures the interface.

But apparently POST_UP-scripts only get executed if both clients succeed. This cannot be the case in our network because we don't use DHCPv6, therefore wicked times out and never executes my unit. I added the proper configuration for at least all wireguard clients:

This whole mess unfortunately revealed that I also need to handle IPv6 SLAAC which the script can now do with

The configuration of the MR is already persistent on monitor and survived a reboot so the MR is finally ready to be merged.


