Project

General

Profile

Actions

action #167257

closed

Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S

Added by livdywan 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
2024-10-25
% Done:

0%

Estimated time:

Description

Observation

Trying to open Grafana to check alerts I found it's not available and showing a white page with an error code:

502 Bad Gateway
nginx/1.21.5

Acceptance criteria

  • AC1: During expected deployments of grafana a proper user-facing status is shown instead of "bad gateway" is shown
  • AC2: We still ensure that grafana related config updates are applied
  • AC3: We are alerted if grafana refuses to start up at all (e.g. failing systemd service triggering alert)

Suggestions


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #167584: grafana-server on monitor.qe.nue2.suse.org yields "502 Bad Gateway", fails to start since 2024-09-28 03:57ZResolvedokurz2024-09-29

Actions
Copied from openQA Infrastructure (public) - action #166979: Grafana aka monitor.qa.suse.de reporting Bad Gateway errorResolvedokurz2024-09-18

Actions
Actions #1

Updated by livdywan 3 months ago

  • Copied from action #166979: Grafana aka monitor.qa.suse.de reporting Bad Gateway error added
Actions #2

Updated by nicksinger 3 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by nicksinger 3 months ago

  • Priority changed from High to Normal

Currently grafana is available again. Since I just recently merged a commit and the deployment ran/was fixed, I assume this might correlate with the restart of grafana on the monitor host. I remember we had a similar discussion in the past and deemed it unfeasible to hot-reload grafana. I will check if something changed or if I can find an easy workaround.

Actions #4

Updated by okurz 3 months ago

  • Subject changed from Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again to Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S
  • Description updated (diff)
Actions #5

Updated by nicksinger 3 months ago

I found https://grafana.com/docs/grafana/latest/developers/http_api/admin/#reload-provisioning-configurations which could help in reloading grafana during deployments. I tried it on monitor with: curl -X POST --unix-socket /var/run/grafana/grafana.socket http:/api/admin/settings but only get a 404 - I think this is related to not being authenticated. Since the upstream docs state multiple times that a "local user" is needed I went ahead and tried to use "admin" (which also should have enough permissions to do all this). Unfortunately the password is lost so I created https://gitlab.suse.de/openqa/password/-/merge_requests/18

Actions #6

Updated by nicksinger 3 months ago

next steps:

  • reset the admin password with the documented one
  • try if the api calls work with that user
  • implement a small script or something to hook into systemd to reload the service
  • add a timer to properly restart grafana nightly (reloads will eventually not catch everything I think).
  • add a simple error_page 502-directive (http://nginx.org/en/docs/http/ngx_http_core_module.html#) to our monitor host explaining on a simple, static page what to do if the nightly restart failed
Actions #7

Updated by openqa_review 3 months ago

  • Due date set to 2024-10-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by nicksinger 3 months ago

  • Status changed from In Progress to Feedback

nicksinger wrote in #note-6:

next steps:

  • reset the admin password with the documented one
  • try if the api calls work with that user
  • implement a small script or something to hook into systemd to reload the service
  • add a timer to properly restart grafana nightly (reloads will eventually not catch everything I think).
  • add a simple error_page 502-directive (http://nginx.org/en/docs/http/ngx_http_core_module.html#) to our monitor host explaining on a simple, static page what to do if the nightly restart failed

In yesterdays daily we were discussing about a nightly restart and decided that we only implement this if we realize the other restarts (e.g. at reboot, after package updates) are not sufficient for us. I created a reload-script now and hook it into our grafana service to make systemctl reload grafana-server possible. I'm not sure yet where to put the admin credentials but most likely they will end up either manually configured or in our pillars (have to cross-check if they are public). Also until now nothing makes use of this reload function yet so I have to go trough our states and pipelines and find places where we restart grafana and replace it with a reload-or-restart.

Actions #10

Updated by nicksinger 3 months ago

  • Status changed from Feedback to Workable

MR merged. But I still need to add the credentials for the admin user

Actions #11

Updated by okurz 3 months ago

  • Related to action #167584: grafana-server on monitor.qe.nue2.suse.org yields "502 Bad Gateway", fails to start since 2024-09-28 03:57Z added
Actions #13

Updated by nicksinger 3 months ago

I had to follow up fixing two mistakes:

The service is now capable of handling a "reload" command. I guess we will see in the future if our changes get properly applied by grafanas provisioning system. This should cover AC2. AC1 is covered implicitly by not showing the user an error at all at expected deployments (deployments now use "reload": https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1272/diffs#4878b8e726c5a6267ccc8ac5890a0280aad40e9d_219_229).

AC3 should be covered by: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1282

Actions #14

Updated by nicksinger 3 months ago

  • Status changed from Feedback to Resolved

all merged and deployed.

Actions #15

Updated by okurz 2 months ago

  • Due date changed from 2024-10-09 to 2024-10-25
  • Status changed from Resolved to Workable

13.10.2024 03:37:46 root root@monitor.qa.suse.de:

grafana-server.service on host monitor.qe.nue2.suse.org failed to start
please ssh into the host and check systemctl status grafana-server.service for potential reasons

As I have received more alert messages afterwards I assume this was transient and we should look into this message to avoid it in the future

Actions #16

Updated by nicksinger 2 months ago

  • Status changed from Workable to Feedback
Actions #17

Updated by okurz 2 months ago

merged and deployed. I triggered a reboot of monitor.

Actions #18

Updated by nicksinger 2 months ago

  • Status changed from Feedback to Resolved
root@monitor:~ # uptime
 18:49:56  up   2:34,  1 user,  load average: 1.13, 1.25, 1.34
root@monitor:~ # systemctl status grafana-server
● grafana-server.service - Grafana instance
     Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/grafana-server.service.d
             └─00-enable-reload.conf, 01-service-fail-mail.conf, override.conf
     Active: active (running) since Tue 2024-10-15 16:16:57 CEST; 2h 33min ago

reboot was conducted and the service came up without alerting us by mail. I purposely broke /etc/grafana/grafana.ini and restarted the service which failed and logged:

root@monitor:~ # systemctl restart grafana-server
Job for grafana-server.service failed because the control process exited with error code.
See "systemctl status grafana-server.service" and "journalctl -xeu grafana-server.service" for details.
root@monitor:~ # systemctl status grafana-server
× grafana-server.service - Grafana instance
     Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/grafana-server.service.d
             └─00-enable-reload.conf, 01-service-fail-mail.conf, override.conf
     Active: failed (Result: exit-code) since Tue 2024-10-15 18:52:58 CEST; 15s ago
   Duration: 2h 35min 51.622s
       Docs: http://docs.grafana.org
    Process: 15139 ExecStart=/usr/share/grafana/bin/grafana server --config=${CONF_FILE} --pidfile=${PID_FILE_DIR}/grafana-server.pid --packaging=rpm cfg:default.paths.logs=${LOG_DIR} cfg:default.paths.data=${DATA_DIR} cfg:default.paths.plugins=${PLUGINS_DIR} cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} (>
   Main PID: 15139 (code=exited, status=1/FAILURE)
        CPU: 506ms

Oct 15 18:52:56 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 4.
Oct 15 18:52:56 monitor systemd[1]: Starting Grafana instance...
Oct 15 18:52:57 monitor grafana[15139]: logger=settings t=2024-10-15T18:52:57.896695559+02:00 level=error msg="failed to parse \"/etc/grafana/grafana.ini\": key-value delimiter not found: NSINGER BROKE THIS _ DELETE ME IF YOU FIND ME AND RESTART GRAFANA\n"
Oct 15 18:52:57 monitor systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 18:52:57 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 5.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Start request repeated too quickly.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Oct 15 18:52:58 monitor systemd[1]: Failed to start Grafana instance.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Triggering OnFailure= dependencies.

And I received an e-mail: So it alerts on "real" issues but keeps silent on reboots.

Actions

Also available in: Atom PDF