Project

General

Profile

Actions

action #130790

closed

[alert] failed systemd alert openqa-staging-2 velociraptor-client size:M

Added by okurz 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-06-13
Due date:
2023-07-01
% Done:

0%

Estimated time:

Description

Observation

Received grafana alert. As the machine openqa-staging-2 is not currently in production I called sudo systemctl disable --now telegraf which shortly remedied the situation but then the alert re-appeared. I checked and found telegraf back up again despite the service being masked? I triggered a reboot. Will monitor.

Acceptance criteria

  • AC1: No alert should be received for one of our staging instances

Suggestions

  • Research how a systemd service could be masked but also started again. Can we find in logs what started the service?
  • Crosscheck the situation again, eg. check journalctl on telegraf covering the previous period when okurz stupidly declared the ticket as resolved when apparently it wasn't
  • Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
  • Next time wait more days if the problem reappears
Actions #1

Updated by okurz 11 months ago

  • Due date set to 2023-06-27
  • Status changed from In Progress to Feedback

After reboot https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-6h&to=now is back to green. systemctl status on openqa-staging-2 for now is green, no failed services. telegraf is masked and not running.

Actions #2

Updated by okurz 11 months ago

  • Description updated (diff)
  • Due date deleted (2023-06-27)
  • Status changed from Feedback to Resolved

I triggered a reboot and telegraf did not come up. Also after a night we are good.

Actions #3

Updated by okurz 11 months ago

  • Status changed from Resolved to In Progress
Actions #4

Updated by okurz 11 months ago

  • Subject changed from [alert] failed systemd alert openqa-staging-2 velociraptor-client to [alert] failed systemd alert openqa-staging-2 velociraptor-client size:M
  • Description updated (diff)
Actions #5

Updated by mkittler 11 months ago

  • Assignee changed from okurz to mkittler

I'm currently on alert duty and it the service is now running again after I've just stopped it. So I'm taking over.

Actions #6

Updated by mkittler 11 months ago

I've just renewed the masking. Strangely there was an additional symlink under multi-user.target.wants present:

sudo systemctl unmask telegraf.service 
Removed /etc/systemd/system/telegraf.service.
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.
Actions #7

Updated by mkittler 11 months ago

Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?

The osd deployment just invokes sudo salt -C 'G@roles:worker' cmd.run '…' and I've just tested that this definitely not includes staging-2. It also doesn't mess with telegraf at all.

The last logs of the gitlab CI pipelines don't mention staging-2 at all. Besides, the last pipeline finished 4 hours ago but the last "enablement" of the service was just about one hour ago.

EDIT: It has actually just started itself again yet another time. There was also definitely no reboot in between. The journal looks like this:

$ sudo journalctl --since '2023-06-16 14:00' -fu telegraf.service 
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopping Telegraf...
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Stopping running outputs
Jun 16 14:08:47 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopped Telegraf.
Jun 16 14:18:03 openqa-staging-2 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Starting Telegraf unknown
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded inputs: chrony cpu disk diskio exec (2x) http http_response kernel mem net ping (12x) processes procstat (5x) swap system
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded aggregators:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded processors:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded outputs: influxdb
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Tags enabled: host=openqa-staging-2
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"openqa-staging-2", Flush Interval:1m0s
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopping telegraf.service...
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Stopping running outputs
Jun 16 14:25:20 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopped telegraf.service.

The stopping was initiated manually by me and the starting in the middle is rather strange.

Note that salt-minion is actively trying to enable the service:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 09:18:03,917 [salt.state       :319 ][ERROR   ][23887] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 10:18:05,335 [salt.state       :319 ][ERROR   ][30752] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 11:18:05,402 [salt.state       :319 ][ERROR   ][5388] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 12:18:04,381 [salt.state       :319 ][ERROR   ][12705] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 13:18:04,323 [salt.state       :319 ][ERROR   ][19454] {'telegraf': True}
2023-06-16 14:18:03,565 [salt.state       :319 ][ERROR   ][26277] {'telegraf': True}

So I still suspect salt-minion is causing this.

Actions #8

Updated by mkittler 11 months ago

Ok, this is because the salt-minion is actually connected to a local salt-master. I'll disable and stop both now. How salt managed to start the masked unit is still unclear. It was definitely not only masked but also disabled and stopped and then running again (while still being masked!).

I triggered a reboot. If the problem doesn't happen again until Monday I will resolve the ticket. EDIT: It looks still good after the reboot.

Actions #9

Updated by openqa_review 11 months ago

  • Due date set to 2023-07-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by mkittler 11 months ago

  • Status changed from In Progress to Resolved

No further alerts (about this) so I'm resolving.

Actions

Also available in: Atom PDF