action #167257
closedGrafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S
0%
Description
Observation¶
Trying to open Grafana to check alerts I found it's not available and showing a white page with an error code:
502 Bad Gateway
nginx/1.21.5
Acceptance criteria¶
- AC1: During expected deployments of grafana a proper user-facing status is shown instead of "bad gateway" is shown
- AC2: We still ensure that grafana related config updates are applied
- AC3: We are alerted if grafana refuses to start up at all (e.g. failing systemd service triggering alert)
Suggestions¶
- Check ssh access
systemctl status grafana-server
. As needed restart grafana- Look into custom bad gateway pages for nginx, e.g. https://stackoverflow.com/questions/7796237/custom-bad-gateway-page-with-nginx or https://serverfault.com/questions/185637/custom-page-on-502-bad-gateway-error/194301#194301
- Consider notifying nginx about pending grafana restarts, e.g. preexec call in custom systemd service override
- Can we just trigger a reload of grafana instead of restarting?
- Inspect pipelines for alert conflicts, see also #166979
Updated by livdywan 3 months ago
- Copied from action #166979: Grafana aka monitor.qa.suse.de reporting Bad Gateway error added
Updated by nicksinger 3 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger 3 months ago
- Priority changed from High to Normal
Currently grafana is available again. Since I just recently merged a commit and the deployment ran/was fixed, I assume this might correlate with the restart of grafana on the monitor host. I remember we had a similar discussion in the past and deemed it unfeasible to hot-reload grafana. I will check if something changed or if I can find an easy workaround.
Updated by nicksinger 3 months ago
I found https://grafana.com/docs/grafana/latest/developers/http_api/admin/#reload-provisioning-configurations which could help in reloading grafana during deployments. I tried it on monitor with: curl -X POST --unix-socket /var/run/grafana/grafana.socket http:/api/admin/settings
but only get a 404 - I think this is related to not being authenticated. Since the upstream docs state multiple times that a "local user" is needed I went ahead and tried to use "admin" (which also should have enough permissions to do all this). Unfortunately the password is lost so I created https://gitlab.suse.de/openqa/password/-/merge_requests/18
Updated by nicksinger 3 months ago
next steps:
- reset the admin password with the documented one
- try if the api calls work with that user
- implement a small script or something to hook into systemd to reload the service
- add a timer to properly restart grafana nightly (reloads will eventually not catch everything I think).
- add a simple
error_page 502
-directive (http://nginx.org/en/docs/http/ngx_http_core_module.html#) to our monitor host explaining on a simple, static page what to do if the nightly restart failed
Updated by openqa_review 3 months ago
- Due date set to 2024-10-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 3 months ago
- Status changed from In Progress to Feedback
nicksinger wrote in #note-6:
next steps:
- reset the admin password with the documented one
- try if the api calls work with that user
- implement a small script or something to hook into systemd to reload the service
- add a timer to properly restart grafana nightly (reloads will eventually not catch everything I think).
- add a simple
error_page 502
-directive (http://nginx.org/en/docs/http/ngx_http_core_module.html#) to our monitor host explaining on a simple, static page what to do if the nightly restart failed
In yesterdays daily we were discussing about a nightly restart and decided that we only implement this if we realize the other restarts (e.g. at reboot, after package updates) are not sufficient for us. I created a reload-script now and hook it into our grafana service to make systemctl reload grafana-server
possible. I'm not sure yet where to put the admin credentials but most likely they will end up either manually configured or in our pillars (have to cross-check if they are public). Also until now nothing makes use of this reload function yet so I have to go trough our states and pipelines and find places where we restart grafana and replace it with a reload-or-restart
.
Updated by nicksinger 3 months ago
Updated by nicksinger 3 months ago
- Status changed from Feedback to Workable
MR merged. But I still need to add the credentials for the admin user
Updated by okurz 3 months ago
- Related to action #167584: grafana-server on monitor.qe.nue2.suse.org yields "502 Bad Gateway", fails to start since 2024-09-28 03:57Z added
Updated by nicksinger 3 months ago
- Status changed from Workable to Feedback
Updated by nicksinger 3 months ago
I had to follow up fixing two mistakes:
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1280
- https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1281
The service is now capable of handling a "reload" command. I guess we will see in the future if our changes get properly applied by grafanas provisioning system. This should cover AC2. AC1 is covered implicitly by not showing the user an error at all at expected deployments (deployments now use "reload": https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1272/diffs#4878b8e726c5a6267ccc8ac5890a0280aad40e9d_219_229).
AC3 should be covered by: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1282
Updated by nicksinger 3 months ago
- Status changed from Feedback to Resolved
all merged and deployed.
Updated by okurz 2 months ago
- Due date changed from 2024-10-09 to 2024-10-25
- Status changed from Resolved to Workable
13.10.2024 03:37:46 root root@monitor.qa.suse.de:
grafana-server.service on host monitor.qe.nue2.suse.org failed to start
please ssh into the host and check systemctl status grafana-server.service for potential reasons
As I have received more alert messages afterwards I assume this was transient and we should look into this message to avoid it in the future
Updated by nicksinger 2 months ago
- Status changed from Workable to Feedback
Updated by nicksinger 2 months ago
- Status changed from Feedback to Resolved
root@monitor:~ # uptime
18:49:56 up 2:34, 1 user, load average: 1.13, 1.25, 1.34
root@monitor:~ # systemctl status grafana-server
● grafana-server.service - Grafana instance
Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/grafana-server.service.d
└─00-enable-reload.conf, 01-service-fail-mail.conf, override.conf
Active: active (running) since Tue 2024-10-15 16:16:57 CEST; 2h 33min ago
reboot was conducted and the service came up without alerting us by mail. I purposely broke /etc/grafana/grafana.ini
and restarted the service which failed and logged:
root@monitor:~ # systemctl restart grafana-server
Job for grafana-server.service failed because the control process exited with error code.
See "systemctl status grafana-server.service" and "journalctl -xeu grafana-server.service" for details.
root@monitor:~ # systemctl status grafana-server
× grafana-server.service - Grafana instance
Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/grafana-server.service.d
└─00-enable-reload.conf, 01-service-fail-mail.conf, override.conf
Active: failed (Result: exit-code) since Tue 2024-10-15 18:52:58 CEST; 15s ago
Duration: 2h 35min 51.622s
Docs: http://docs.grafana.org
Process: 15139 ExecStart=/usr/share/grafana/bin/grafana server --config=${CONF_FILE} --pidfile=${PID_FILE_DIR}/grafana-server.pid --packaging=rpm cfg:default.paths.logs=${LOG_DIR} cfg:default.paths.data=${DATA_DIR} cfg:default.paths.plugins=${PLUGINS_DIR} cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} (>
Main PID: 15139 (code=exited, status=1/FAILURE)
CPU: 506ms
Oct 15 18:52:56 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 4.
Oct 15 18:52:56 monitor systemd[1]: Starting Grafana instance...
Oct 15 18:52:57 monitor grafana[15139]: logger=settings t=2024-10-15T18:52:57.896695559+02:00 level=error msg="failed to parse \"/etc/grafana/grafana.ini\": key-value delimiter not found: NSINGER BROKE THIS _ DELETE ME IF YOU FIND ME AND RESTART GRAFANA\n"
Oct 15 18:52:57 monitor systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 18:52:57 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 5.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Start request repeated too quickly.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Oct 15 18:52:58 monitor systemd[1]: Failed to start Grafana instance.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Triggering OnFailure= dependencies.
And I received an e-mail: So it alerts on "real" issues but keeps silent on reboots.