action #130790
closed[alert] failed systemd alert openqa-staging-2 velociraptor-client size:M
0%
Description
Observation¶
Received grafana alert. As the machine openqa-staging-2 is not currently in production I called sudo systemctl disable --now telegraf
which shortly remedied the situation but then the alert re-appeared. I checked and found telegraf back up again despite the service being masked? I triggered a reboot. Will monitor.
Acceptance criteria¶
- AC1: No alert should be received for one of our staging instances
Suggestions¶
- Research how a systemd service could be masked but also started again. Can we find in logs what started the service?
- Crosscheck the situation again, eg. check journalctl on telegraf covering the previous period when okurz stupidly declared the ticket as resolved when apparently it wasn't
- Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
- Next time wait more days if the problem reappears
Updated by okurz over 1 year ago
- Due date set to 2023-06-27
- Status changed from In Progress to Feedback
After reboot https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-6h&to=now is back to green. systemctl status
on openqa-staging-2 for now is green, no failed services. telegraf is masked and not running.
Updated by okurz over 1 year ago
- Description updated (diff)
- Due date deleted (
2023-06-27) - Status changed from Feedback to Resolved
I triggered a reboot and telegraf did not come up. Also after a night we are good.
Updated by okurz over 1 year ago
- Subject changed from [alert] failed systemd alert openqa-staging-2 velociraptor-client to [alert] failed systemd alert openqa-staging-2 velociraptor-client size:M
- Description updated (diff)
Updated by mkittler over 1 year ago
- Assignee changed from okurz to mkittler
I'm currently on alert duty and it the service is now running again after I've just stopped it. So I'm taking over.
Updated by mkittler over 1 year ago
I've just renewed the masking. Strangely there was an additional symlink under multi-user.target.wants
present:
sudo systemctl unmask telegraf.service
Removed /etc/systemd/system/telegraf.service.
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
Updated by mkittler over 1 year ago
Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
The osd deployment just invokes sudo salt -C 'G@roles:worker' cmd.run '…'
and I've just tested that this definitely not includes staging-2. It also doesn't mess with telegraf at all.
The last logs of the gitlab CI pipelines don't mention staging-2 at all. Besides, the last pipeline finished 4 hours ago but the last "enablement" of the service was just about one hour ago.
EDIT: It has actually just started itself again yet another time. There was also definitely no reboot in between. The journal looks like this:
$ sudo journalctl --since '2023-06-16 14:00' -fu telegraf.service
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopping Telegraf...
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Stopping running outputs
Jun 16 14:08:47 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopped Telegraf.
Jun 16 14:18:03 openqa-staging-2 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Starting Telegraf unknown
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded inputs: chrony cpu disk diskio exec (2x) http http_response kernel mem net ping (12x) processes procstat (5x) swap system
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded aggregators:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded processors:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded outputs: influxdb
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Tags enabled: host=openqa-staging-2
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"openqa-staging-2", Flush Interval:1m0s
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopping telegraf.service...
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Stopping running outputs
Jun 16 14:25:20 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopped telegraf.service.
The stopping was initiated manually by me and the starting in the middle is rather strange.
Note that salt-minion is actively trying to enable the service:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 09:18:03,917 [salt.state :319 ][ERROR ][23887] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 10:18:05,335 [salt.state :319 ][ERROR ][30752] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 11:18:05,402 [salt.state :319 ][ERROR ][5388] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 12:18:04,381 [salt.state :319 ][ERROR ][12705] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 13:18:04,323 [salt.state :319 ][ERROR ][19454] {'telegraf': True}
2023-06-16 14:18:03,565 [salt.state :319 ][ERROR ][26277] {'telegraf': True}
So I still suspect salt-minion is causing this.
Updated by mkittler over 1 year ago
Ok, this is because the salt-minion is actually connected to a local salt-master. I'll disable and stop both now. How salt managed to start the masked unit is still unclear. It was definitely not only masked but also disabled and stopped and then running again (while still being masked!).
I triggered a reboot. If the problem doesn't happen again until Monday I will resolve the ticket. EDIT: It looks still good after the reboot.
Updated by openqa_review over 1 year ago
- Due date set to 2023-07-01
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
- Status changed from In Progress to Resolved
No further alerts (about this) so I'm resolving.