Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
The osd deployment just invokes sudo salt -C 'G@roles:worker' cmd.run '…'
and I've just tested that this definitely not includes staging-2. It also doesn't mess with telegraf at all.
The last logs of the gitlab CI pipelines don't mention staging-2 at all. Besides, the last pipeline finished 4 hours ago but the last "enablement" of the service was just about one hour ago.
EDIT: It has actually just started itself again yet another time. There was also definitely no reboot in between. The journal looks like this:
$ sudo journalctl --since '2023-06-16 14:00' -fu telegraf.service
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopping Telegraf...
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Stopping running outputs
Jun 16 14:08:47 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopped Telegraf.
Jun 16 14:18:03 openqa-staging-2 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Starting Telegraf unknown
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded inputs: chrony cpu disk diskio exec (2x) http http_response kernel mem net ping (12x) processes procstat (5x) swap system
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded aggregators:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded processors:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded outputs: influxdb
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Tags enabled: host=openqa-staging-2
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"openqa-staging-2", Flush Interval:1m0s
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopping telegraf.service...
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Stopping running outputs
Jun 16 14:25:20 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopped telegraf.service.
The stopping was initiated manually by me and the starting in the middle is rather strange.
Note that salt-minion is actively trying to enable the service:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 09:18:03,917 [salt.state :319 ][ERROR ][23887] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 10:18:05,335 [salt.state :319 ][ERROR ][30752] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 11:18:05,402 [salt.state :319 ][ERROR ][5388] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 12:18:04,381 [salt.state :319 ][ERROR ][12705] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:
Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 13:18:04,323 [salt.state :319 ][ERROR ][19454] {'telegraf': True}
2023-06-16 14:18:03,565 [salt.state :319 ][ERROR ][26277] {'telegraf': True}
So I still suspect salt-minion is causing this.