Project

General

Profile

Actions

action #176175

open

coordination #161414: [epic] Improved salt based infrastructure management

[alert] Grafana failed to start due to corrupted config file

Added by tinita about 1 month ago. Updated about 1 month ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-01-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

From

  • Date: Sun, 26 Jan 2025 02:36:49 +0000 (UTC) to
  • Date: Sun, 26 Jan 2025 08:42:55 +0000 (UTC)

we got 10 emails

grafana-server.service on host monitor.qe.nue2.suse.org failed to start
please ssh into the host and check systemctl status grafana-server.service for potential reasons
% journalctl -u grafana-server.service --since "2025-01-26 02:30:00"
...
Jan 26 03:39:26 monitor systemd[1]: Starting Grafana instance...                   
Jan 26 03:39:26 monitor grafana[22024]: logger=settings t=2025-01-26T03:39:26.533904269Z level=error msg="failed to parse \"/etc/grafana/grafana.ini\": key-value delimiter not found: \u0000\u0000\u0000\u0000\u>
Jan 26 03:39:26 monitor grafana[22024]: u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u>
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 5.
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Start request repeated too quickly.
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Jan 26 03:39:26 monitor systemd[1]: Failed to start Grafana instance.              

So it seems that /etc/grafana/grafana.ini was broken.
According to the log, the server started again successfully at 09:01:16


Related issues 2 (2 open0 closed)

Related to openQA Infrastructure (public) - action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17Blockedokurz2024-07-10

Actions
Blocked by openQA Infrastructure (public) - action #176250: file corruption in salt controlled config files size:MWorkable

Actions
Actions #1

Updated by tinita about 1 month ago

  • Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Actions #2

Updated by ybonatakis about 1 month ago

I took a look at the alert on Sunday morning. Here what I did to resolve the problem.

sudo cp grafana.ini grafana.ini.invalidchars
tr -d '\0' < grafana.ini > grafana_clean.ini
diff grafana.ini grafana_clean.ini
cp grafana_clean.ini grafana.ini
systemctl restart grafana-server.service

from the diff I can say that a comment line was replaced by a sequence of "@@@@@@*". So no config change from the mitigation.
I checked also whether there was a commit or something related to the salt associated with the file but I dont think there was any. I assume that it is an isolated accident

Actions #3

Updated by okurz about 1 month ago

  • Tags set to infra, reactive work, alert
  • Subject changed from [alert] Grafana failed to start to [alert] Grafana failed to start due to corrupted config file
  • Assignee set to okurz
  • Parent task set to #161414

@ybonatakis thx for the recovery work

I assume that it is an isolated accident

By now we had similar events often enough that I don't think this is an isolated incidence. I plan to create a separate ticket about preventing the corruption. Here we can focus on solely the alert handling and recovery

Actions #4

Updated by okurz about 1 month ago

  • Blocked by action #176250: file corruption in salt controlled config files size:M added
Actions #5

Updated by okurz about 1 month ago

  • Status changed from New to Blocked
  • Priority changed from High to Normal

right now grafana is fine, blocking on #176250

Actions #6

Updated by okurz about 1 month ago

  • Target version changed from Ready to future
Actions

Also available in: Atom PDF