action #170338: No monitoring data from OSD since 2024-11-25 1449Z size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #170338

closed

No monitoring data from OSD since 2024-11-25 1449Z size:M

Added by okurz 3 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-11-27

Due date:

% Done:

Estimated time:

Tags:

osd, monitoring, infra

Description

Observation¶

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2024-11-25T14:15:24.661Z&to=2024-11-25T14:58:43.878Z&var-host_disks=$__all&refresh=15m

Acceptance criteria¶

AC1: There is current monitoring data from OSD itself on monitor.qa.suse.de
AC2: There is also monitoring data after reboots of monitor+OSD

Acceptance tests¶

AT1-1: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-7d&to=now&var-host_disks=$__all&refresh=15m&viewPanel=panel-78 shows current data
AT2-1: Same as AT1-1 but reboot monitor in before
AT2-2: Same as AT1-1 but reboot OSD in before

Suggestions¶

Handle IPv4+IPv6 double routing problems after setting up wireguard tunnels disrupting also our monitoring
Understand what approach to take for routing with VPN in place and consider both source and target hosts for communication
Might need changes to multiple hosts
Make changes persistent in salt
Ensure reboot consistency

Rollback actions¶

Remove alert silence from https://monitor.qa.suse.de/alerting/silences called rule_uid=~host_up_alert.*
Remove alert silence from https://monitor.qa.suse.de/alerting/silences called alertname=Failed systemd services alert (except openqa.suse.de)

Related issues 5 (0 open — 5 closed)

Related to openQA Infrastructure (public) - action #169564: Configure wireguard tunnels on OSD production hosts needed for openQA located in the NUE2 server room size:S

Resolved

mkittler

Actions

Related to openQA Infrastructure (public) - action #170473: k2.qe.suse.de not reachable from mania:2 size:S

Resolved

ph03nix

2024-11-28

Actions

Related to openQA Infrastructure (public) - action #174550: grafana silence linking to #164853 but alert is about diesel?

Resolved

gpathak

2024-12-18

Actions

Copied to openQA Infrastructure (public) - action #170494: nginx.service on monitor failed because of: "No such file or directory:calling fopen(/etc/dehydrated/certs/loki.qa.suse.de/fullchain.pem"

Resolved

nicksinger

Actions

Copied to openQA Infrastructure (public) - action #173824: Failed configure-source-based-routing@br0.service on qamaster "Error: ipv4: FIB table does not exist." size:S

Resolved

nicksinger

2024-11-27

Actions

Copy link

Updated by okurz 3 months ago

Related to action #169564: Configure wireguard tunnels on OSD production hosts needed for openQA located in the NUE2 server room size:S added

Actions

Copy link

Updated by nicksinger 3 months ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger 3 months ago

Status changed from In Progress to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-174272

Apparently we misunderstood enginfra and the wg-tunnel does not allow us back-and-forth communication. Even with the tunnel in place the connection has to be established from within a CC-compliant area. Given our monitoring infra pushes data to the monitoring host, this cannot work.

Actions

Copy link

Updated by nicksinger 3 months ago

Status changed from Blocked to In Progress

Actions

Copy link

Updated by nicksinger 3 months ago

Priority changed from Urgent to Normal

Together with Robert we found out that "asymmetric routing" is our problem. A CC-system like OSD reaches monitor via the usual network (eth0, 10.145.10.207/24 -> 10.168.192.191) but not over the wg ip of that system (in case of monitor, 10.144.169.6/32). This outgoing connection then "punches" a hole into the firewall allowing the non-CC system to reach OSD just for the answering packet. However, based on how wg sets up routes on non-CC systems (routing all CC traffic trough the tunnel because that is the whole point of the wg tunnel), causes these systems so send back their answer via the tunnel. So OSD sends out a ping via eth0 but receives the answer with a wrong source IP.

The solution is to ensure that non-CC systems always answer packets on the interface they receive them. A way to implement this is https://unix.stackexchange.com/a/23345 - based on that article I did the following on monitor:

echo "240     nowg" >> /etc/iproute2/rt_tables
ip route add default via 10.168.195.254 dev eth0 table nowg #taken from `ip r s default`
ip rule add from 10.168.192.191/22 table nowg prio 1

This appears to work and we have data back in grafana: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2024-11-27%2018:03:13&to=now&var-host_disks=$__all&refresh=15m

This is just a temporary workaround and will be gone after a reboot so we need to find a way to make it a) reboot safe and b) deploy it via salt on all machines.

Actions

Copy link

Updated by openqa_review 3 months ago

Due date set to 2024-12-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz 3 months ago

Parent task set to #166598

Actions

Copy link

Updated by okurz 3 months ago

Priority changed from Normal to High

Actions

Copy link

Updated by jbaier_cz 3 months ago

Related to action #170473: k2.qe.suse.de not reachable from mania:2 size:S added

Actions

Copy link

#10

Updated by nicksinger 3 months ago

Copied to action #170494: nginx.service on monitor failed because of: "No such file or directory:calling fopen(/etc/dehydrated/certs/loki.qa.suse.de/fullchain.pem" added

Actions

Copy link

#11

Updated by nicksinger 3 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316 adds a service which can be used as hook script in wicked. To add this hook script to our default interfaces I had to fix a default-grain we used in the past: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1319

Actions

Copy link

#12

Updated by nicksinger 3 months ago

nicksinger wrote in #note-11:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316 adds a service which can be used as hook script in wicked. To add this hook script to our default interfaces I had to fix a default-grain we used in the past: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1319

I think https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1319 made v6 work again on some machines and now causes the ping to fail this way. Fortunately the fix is similar to v4 and I will extend the other MR to cover both.

Actions

Copy link

#13

Updated by okurz 3 months ago

Priority changed from High to Urgent

who would have thought that a reboot of monitor.qe.nue2.suse.org caused issues after not all changes are made persistent yet ;) See https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2024-11-30T18:08:34.881Z&to=2024-12-01T06:24:45.668Z&var-host_disks=$__all&refresh=15m

Actions

Copy link

#14

Updated by gpuliti 3 months ago

Description updated (diff)

Actions

Copy link

#15

Updated by mkittler 3 months ago

Subject changed from No monitoring data from OSD since 2024-11-25 1449Z to No monitoring data from OSD since 2024-11-25 1449Z size:M
Description updated (diff)

Actions

Copy link

#16

Updated by nicksinger 3 months ago · Edited

I've added more necessary changes to https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316/diffs?commit_id=7d5d499a5fd0b26d9d336fbab8e613909e3f881d but unfortunately failed right at the last step: for some reason wicked is not executing the service POST_UP. I've asked in Slack and already found out that our wicked gets "stuck" somewhere and never reaches the POST-phase (despite everything working). I might implement a workaround if needed.

Actions

Copy link

#17

Updated by nicksinger 3 months ago

Turns out that our BOOTPROTO was set not quite right.

The documentation states:

The setup is considered successful, when at least one dhcp client configures the interface.

But apparently POST_UP-scripts only get executed if both clients succeed. This cannot be the case in our network because we don't use DHCPv6, therefore wicked times out and never executes my unit. I added the proper configuration for at least all wireguard clients: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316/diffs?commit_id=a595b66ce48b839cbff4f45ca4fa82106d810ca2#5559975917ec39e2f64d2504d12e98d130bd6db8_29_38

This whole mess unfortunately revealed that I also need to handle IPv6 SLAAC which the script can now do with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316/diffs#8d0aa9912df7ee335387b755578ebd6d872c0f7f_0_10

The configuration of the MR is already persistent on monitor and survived a reboot so the MR is finally ready to be merged.

Actions

Copy link

#18

Updated by okurz 3 months ago

Description updated (diff)

multiple related failed system service alerts, see https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=2024-12-03T18:04:02.089Z&to=2024-12-04T11:38:01.897Z
I added a silence and according rollback action

Actions

Copy link

#19

Updated by okurz 3 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316/ is merged.

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-7d&to=now&var-host_disks=$__all&refresh=15m&viewPanel=panel-78 looks good. What else is necessary here?

Actions

Copy link

#20

Updated by nicksinger 3 months ago

Priority changed from Urgent to Normal

okurz wrote in #note-19:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1316/ is merged.

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=now-7d&to=now&var-host_disks=$__all&refresh=15m&viewPanel=panel-78 looks good. What else is necessary here?

right, I restarted the affected machines and checked with: ip -4 rule show && echo --- && ip -6 rule show if the nowg-rules show up:

monitor.qe.nue2.suse.org:
    0:  from all lookup local
    1:  from 10.168.192.191/22 lookup nowg
    32766:  from all lookup main
    32767:  from all lookup default
    ---
    0:  from all lookup local
    1:  from 2a07:de40:a102:5:5054:ff:fe00:894e/64 lookup nowg
    32766:  from all lookup main

this makes monitor reboot-stable, reduces prio and I can execute all rollback-steps. I will also monitor other affected workers (e.g. sapworker1, petrol, diesel, mania, openqa-piworker) a little closer then usual (e.g. messages in #eng-testing).

Actions

Copy link

#21

Updated by nicksinger 3 months ago

Status changed from In Progress to Resolved

The whole network got changed once again, no clue if my change is effective or not but we see monitoring data again.

Actions

Copy link

#22

Updated by okurz 3 months ago

Due date deleted (~~2024-12-12~~)

Actions

Copy link

#23

Updated by okurz 3 months ago

Copied to action #173824: Failed configure-source-based-routing@br0.service on qamaster "Error: ipv4: FIB table does not exist." size:S added

Actions

Copy link

#24

Updated by okurz 3 months ago

Status changed from Resolved to Workable

rollback actions not done yet, multiple failed systemd services still reported so please make sure those are covered and referenced accordingly on https://stats.openqa-monitor.qa.suse.de/alerting/silences

Actions

Copy link

#25

Updated by nicksinger 3 months ago

Status changed from Workable to Resolved

right, had to fix quite some stuff on netboot.qe.prg2.suse.org and baremetal-support.qe.prg2.suse.org for these but after all managed to resolve all of them. Also some loki fixes where needed:

Actions

Copy link

#26

Updated by okurz 3 months ago

Related to action #174550: grafana silence linking to #164853 but alert is about diesel? added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #170338

No monitoring data from OSD since 2024-11-25 1449Z size:M

Observation¶

Acceptance criteria¶

Acceptance tests¶

Suggestions¶

Rollback actions¶

Updated by okurz 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by openqa_review 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago

Updated by gpuliti 3 months ago

Updated by mkittler 3 months ago

Updated by nicksinger 3 months ago · Edited

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago