action #130790
closed[alert] failed systemd alert openqa-staging-2 velociraptor-client size:M
0%
Description
Observation¶
Received grafana alert. As the machine openqa-staging-2 is not currently in production I called sudo systemctl disable --now telegraf
which shortly remedied the situation but then the alert re-appeared. I checked and found telegraf back up again despite the service being masked? I triggered a reboot. Will monitor.
Acceptance criteria¶
- AC1: No alert should be received for one of our staging instances
Suggestions¶
- Research how a systemd service could be masked but also started again. Can we find in logs what started the service?
- Crosscheck the situation again, eg. check journalctl on telegraf covering the previous period when okurz stupidly declared the ticket as resolved when apparently it wasn't
- Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
- Next time wait more days if the problem reappears
Updated by okurz 11 months ago
- Due date set to 2023-06-27
- Status changed from In Progress to Feedback
After reboot https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-6h&to=now is back to green. systemctl status
on openqa-staging-2 for now is green, no failed services. telegraf is masked and not running.
Updated by mkittler 11 months ago
I've just renewed the masking. Strangely there was an additional symlink under multi-user.target.wants
present:
sudo systemctl unmask telegraf.service
Removed /etc/systemd/system/telegraf.service.
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
Updated by mkittler 11 months ago
Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
The osd deployment just invokes sudo salt -C 'G@roles:worker' cmd.run '…'
and I've just tested that this definitely not includes staging-2. It also doesn't mess with telegraf at all.
The last logs of the gitlab CI pipelines don't mention staging-2 at all. Besides, the last pipeline finished 4 hours ago but the last "enablement" of the service was just about one hour ago.
EDIT: It has actually just started itself again yet another time. There was also definitely no reboot in between. The journal looks like this:
$ sudo journalctl --since '2023-06-16 14:00' -fu telegraf.service
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopping Telegraf...
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Stopping running outputs
Jun 16 14:08:47 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopped Telegraf.
Jun 16 14:18:03 openqa-staging-2 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Starting Telegraf unknown
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded inputs: chrony cpu disk diskio exec (2x) http http_response kernel mem net ping (12x) processes procstat (5x) swap system
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded aggregators:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded processors:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded outputs: influxdb
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Tags enabled: host=openqa-staging-2
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"openqa-staging-2", Flush Interval:1m0s
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopping telegraf.service...
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Stopping running outputs
Jun 16 14:25:20 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopped telegraf.service.
The stopping was initiated manually by me and the starting in the middle is rather strange.
Note that salt-minion is actively trying to enable the service:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 09:18:03,917 [salt.state :319 ][ERROR ][23887] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 10:18:05,335 [salt.state :319 ][ERROR ][30752] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 11:18:05,402 [salt.state :319 ][ERROR ][5388] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 12:18:04,381 [salt.state :319 ][ERROR ][12705] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 13:18:04,323 [salt.state :319 ][ERROR ][19454] {'telegraf': True}
2023-06-16 14:18:03,565 [salt.state :319 ][ERROR ][26277] {'telegraf': True}
So I still suspect salt-minion is causing this.
Updated by mkittler 11 months ago
Ok, this is because the salt-minion is actually connected to a local salt-master. I'll disable and stop both now. How salt managed to start the masked unit is still unclear. It was definitely not only masked but also disabled and stopped and then running again (while still being masked!).
I triggered a reboot. If the problem doesn't happen again until Monday I will resolve the ticket. EDIT: It looks still good after the reboot.
Updated by openqa_review 11 months ago
- Due date set to 2023-07-01
Setting due date based on mean cycle time of SUSE QE Tools