action #130790: [alert] failed systemd alert openqa-staging-2 velociraptor-client size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #130790

closed

[alert] failed systemd alert openqa-staging-2 velociraptor-client size:M

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-06-13

Due date:

2023-07-01

% Done:

Estimated time:

Tags:

alert, systemd, infra, telegraf, openqa-staging-2, velociraptor

Description

Observation¶

Received grafana alert. As the machine openqa-staging-2 is not currently in production I called sudo systemctl disable --now telegraf which shortly remedied the situation but then the alert re-appeared. I checked and found telegraf back up again despite the service being masked? I triggered a reboot. Will monitor.

Acceptance criteria¶

AC1: No alert should be received for one of our staging instances

Suggestions¶

Research how a systemd service could be masked but also started again. Can we find in logs what started the service?
Crosscheck the situation again, eg. check journalctl on telegraf covering the previous period when okurz stupidly declared the ticket as resolved when apparently it wasn't
Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?
Next time wait more days if the problem reappears

Actions

Copy link

Updated by okurz almost 2 years ago

Due date set to 2023-06-27
Status changed from In Progress to Feedback

After reboot https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-6h&to=now is back to green. systemctl status on openqa-staging-2 for now is green, no failed services. telegraf is masked and not running.

Actions

Copy link

Updated by okurz almost 2 years ago

Description updated (diff)
Due date deleted (~~2023-06-27~~)
Status changed from Feedback to Resolved

I triggered a reboot and telegraf did not come up. Also after a night we are good.

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Resolved to In Progress

Actions

Copy link

Updated by okurz almost 2 years ago

Subject changed from [alert] failed systemd alert openqa-staging-2 velociraptor-client to [alert] failed systemd alert openqa-staging-2 velociraptor-client size:M
Description updated (diff)

Actions

Copy link

Updated by mkittler almost 2 years ago

Assignee changed from okurz to mkittler

I'm currently on alert duty and it the service is now running again after I've just stopped it. So I'm taking over.

Actions

Copy link

Updated by mkittler almost 2 years ago

I've just renewed the masking. Strangely there was an additional symlink under multi-user.target.wants present:

sudo systemctl unmask telegraf.service 
Removed /etc/systemd/system/telegraf.service.
Removed /etc/systemd/system/multi-user.target.wants/telegraf.service.

Actions

Copy link

Updated by mkittler almost 2 years ago

Crosscheck if maybe osd deployment or salt state gitlab CI pipelines still access the system and re-enable due to salt even though the machine is not in the currently accepted salt keys on OSD?!?

The osd deployment just invokes sudo salt -C 'G@roles:worker' cmd.run '…' and I've just tested that this definitely not includes staging-2. It also doesn't mess with telegraf at all.

The last logs of the gitlab CI pipelines don't mention staging-2 at all. Besides, the last pipeline finished 4 hours ago but the last "enablement" of the service was just about one hour ago.

EDIT: It has actually just started itself again yet another time. There was also definitely no reboot in between. The journal looks like this:

$ sudo journalctl --since '2023-06-16 14:00' -fu telegraf.service 
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopping Telegraf...
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:08:47 openqa-staging-2 telegraf[20411]: 2023-06-16T12:08:47Z I! [agent] Stopping running outputs
Jun 16 14:08:47 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:08:47 openqa-staging-2 systemd[1]: Stopped Telegraf.
Jun 16 14:18:03 openqa-staging-2 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Starting Telegraf unknown
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded inputs: chrony cpu disk diskio exec (2x) http http_response kernel mem net ping (12x) processes procstat (5x) swap system
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded aggregators:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded processors:
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Loaded outputs: influxdb
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! Tags enabled: host=openqa-staging-2
Jun 16 14:18:03 openqa-staging-2 telegraf[27227]: 2023-06-16T12:18:03Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"openqa-staging-2", Flush Interval:1m0s
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopping telegraf.service...
Jun 16 14:25:20 openqa-staging-2 telegraf[27227]: 2023-06-16T12:25:20Z I! [agent] Stopping running outputs
Jun 16 14:25:20 openqa-staging-2 systemd[1]: telegraf.service: Deactivated successfully.
Jun 16 14:25:20 openqa-staging-2 systemd[1]: Stopped telegraf.service.

The stopping was initiated manually by me and the starting in the middle is rather strange.

Note that salt-minion is actively trying to enable the service:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 09:18:03,917 [salt.state       :319 ][ERROR   ][23887] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 10:18:05,335 [salt.state       :319 ][ERROR   ][30752] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 11:18:05,402 [salt.state       :319 ][ERROR   ][5388] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 12:18:04,381 [salt.state       :319 ][ERROR   ][12705] Failed when setting service telegraf to start at boot, but the service was already running. Additional information follows:

Failed to enable unit: Unit file /etc/systemd/system/telegraf.service is masked.
2023-06-16 13:18:04,323 [salt.state       :319 ][ERROR   ][19454] {'telegraf': True}
2023-06-16 14:18:03,565 [salt.state       :319 ][ERROR   ][26277] {'telegraf': True}

So I still suspect salt-minion is causing this.

Actions

Copy link

Updated by mkittler almost 2 years ago

Ok, this is because the salt-minion is actually connected to a local salt-master. I'll disable and stop both now. How salt managed to start the masked unit is still unclear. It was definitely not only masked but also disabled and stopped and then running again (while still being masked!).

I triggered a reboot. If the problem doesn't happen again until Monday I will resolve the ticket. EDIT: It looks still good after the reboot.

Actions

Copy link

Updated by openqa_review almost 2 years ago

Due date set to 2023-07-01

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by mkittler almost 2 years ago

Status changed from In Progress to Resolved

No further alerts (about this) so I'm resolving.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #130790

[alert] failed systemd alert openqa-staging-2 velociraptor-client size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by mkittler almost 2 years ago

Updated by openqa_review almost 2 years ago

Updated by mkittler almost 2 years ago