action #176175
opencoordination #161414: [epic] Improved salt based infrastructure management
[alert] Grafana failed to start due to corrupted config file
0%
Description
Observation¶
From
- Date: Sun, 26 Jan 2025 02:36:49 +0000 (UTC) to
- Date: Sun, 26 Jan 2025 08:42:55 +0000 (UTC)
we got 10 emails
grafana-server.service on host monitor.qe.nue2.suse.org failed to start
please ssh into the host and check systemctl status grafana-server.service for potential reasons
% journalctl -u grafana-server.service --since "2025-01-26 02:30:00"
...
Jan 26 03:39:26 monitor systemd[1]: Starting Grafana instance...
Jan 26 03:39:26 monitor grafana[22024]: logger=settings t=2025-01-26T03:39:26.533904269Z level=error msg="failed to parse \"/etc/grafana/grafana.ini\": key-value delimiter not found: \u0000\u0000\u0000\u0000\u>
Jan 26 03:39:26 monitor grafana[22024]: u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u>
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 5.
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Start request repeated too quickly.
Jan 26 03:39:26 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Jan 26 03:39:26 monitor systemd[1]: Failed to start Grafana instance.
So it seems that /etc/grafana/grafana.ini
was broken.
According to the log, the server started again successfully at 09:01:16
Updated by tinita about 1 month ago
- Related to action #175710: OSD openqa.ini is corrupted, invalid characters, again 2025-01-17 added
Updated by ybonatakis about 1 month ago
I took a look at the alert on Sunday morning. Here what I did to resolve the problem.
sudo cp grafana.ini grafana.ini.invalidchars
tr -d '\0' < grafana.ini > grafana_clean.ini
diff grafana.ini grafana_clean.ini
cp grafana_clean.ini grafana.ini
systemctl restart grafana-server.service
from the diff I can say that a comment line was replaced by a sequence of "@@@@@@*". So no config change from the mitigation.
I checked also whether there was a commit or something related to the salt associated with the file but I dont think there was any. I assume that it is an isolated accident
Updated by okurz about 1 month ago
- Tags set to infra, reactive work, alert
- Subject changed from [alert] Grafana failed to start to [alert] Grafana failed to start due to corrupted config file
- Assignee set to okurz
- Parent task set to #161414
@ybonatakis thx for the recovery work
I assume that it is an isolated accident
By now we had similar events often enough that I don't think this is an isolated incidence. I plan to create a separate ticket about preventing the corruption. Here we can focus on solely the alert handling and recovery
Updated by okurz about 1 month ago
- Blocked by action #176250: file corruption in salt controlled config files size:M added
Updated by okurz about 1 month ago
- Status changed from New to Blocked
- Priority changed from High to Normal
right now grafana is fine, blocking on #176250