action #167257: Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167257

closed

Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S

Added by livdywan 3 months ago. Updated 2 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

2024-10-25

% Done:

Estimated time:

Tags:

grafana, infra, reactive work

Description

Observation¶

Trying to open Grafana to check alerts I found it's not available and showing a white page with an error code:

502 Bad Gateway
nginx/1.21.5

Acceptance criteria¶

AC1: During expected deployments of grafana a proper user-facing status is shown instead of "bad gateway" is shown
AC2: We still ensure that grafana related config updates are applied
AC3: We are alerted if grafana refuses to start up at all (e.g. failing systemd service triggering alert)

Suggestions¶

Check ssh access
systemctl status grafana-server. As needed restart grafana
Look into custom bad gateway pages for nginx, e.g. https://stackoverflow.com/questions/7796237/custom-bad-gateway-page-with-nginx or https://serverfault.com/questions/185637/custom-page-on-502-bad-gateway-error/194301#194301
Consider notifying nginx about pending grafana restarts, e.g. preexec call in custom systemd service override
Can we just trigger a reload of grafana instead of restarting?
Inspect pipelines for alert conflicts, see also #166979

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan 3 months ago

Copied from action #166979: Grafana aka monitor.qa.suse.de reporting Bad Gateway error added

Actions

Copy link

Updated by nicksinger 3 months ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by nicksinger 3 months ago

Priority changed from High to Normal

Currently grafana is available again. Since I just recently merged a commit and the deployment ran/was fixed, I assume this might correlate with the restart of grafana on the monitor host. I remember we had a similar discussion in the past and deemed it unfeasible to hot-reload grafana. I will check if something changed or if I can find an easy workaround.

Actions

Copy link

Updated by okurz 3 months ago

Subject changed from Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again to Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S
Description updated (diff)

Actions

Copy link

Updated by nicksinger 3 months ago

I found https://grafana.com/docs/grafana/latest/developers/http_api/admin/#reload-provisioning-configurations which could help in reloading grafana during deployments. I tried it on monitor with: curl -X POST --unix-socket /var/run/grafana/grafana.socket http:/api/admin/settings but only get a 404 - I think this is related to not being authenticated. Since the upstream docs state multiple times that a "local user" is needed I went ahead and tried to use "admin" (which also should have enough permissions to do all this). Unfortunately the password is lost so I created https://gitlab.suse.de/openqa/password/-/merge_requests/18

Actions

Copy link

Updated by nicksinger 3 months ago

next steps:

reset the admin password with the documented one
try if the api calls work with that user
implement a small script or something to hook into systemd to reload the service
add a timer to properly restart grafana nightly (reloads will eventually not catch everything I think).
add a simple error_page 502-directive (http://nginx.org/en/docs/http/ngx_http_core_module.html#) to our monitor host explaining on a simple, static page what to do if the nightly restart failed

Actions

Copy link

Updated by openqa_review 3 months ago

Due date set to 2024-10-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by nicksinger 3 months ago

Status changed from In Progress to Feedback

nicksinger wrote in #note-6:

next steps:

reset the admin password with the documented one

try if the api calls work with that user

implement a small script or something to hook into systemd to reload the service

add a timer to properly restart grafana nightly (reloads will eventually not catch everything I think).

add a simple error_page 502-directive (http://nginx.org/en/docs/http/ngx_http_core_module.html#) to our monitor host explaining on a simple, static page what to do if the nightly restart failed

In yesterdays daily we were discussing about a nightly restart and decided that we only implement this if we realize the other restarts (e.g. at reboot, after package updates) are not sufficient for us. I created a reload-script now and hook it into our grafana service to make systemctl reload grafana-server possible. I'm not sure yet where to put the admin credentials but most likely they will end up either manually configured or in our pillars (have to cross-check if they are public). Also until now nothing makes use of this reload function yet so I have to go trough our states and pipelines and find places where we restart grafana and replace it with a reload-or-restart.

Actions

Copy link

Updated by nicksinger 3 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1272

Actions

Copy link

#10

Updated by nicksinger 3 months ago

Status changed from Feedback to Workable

MR merged. But I still need to add the credentials for the admin user

Actions

Copy link

#11

Updated by okurz 3 months ago

Related to action #167584: grafana-server on monitor.qe.nue2.suse.org yields "502 Bad Gateway", fails to start since 2024-09-28 03:57Z added

Actions

Copy link

#12

Updated by nicksinger 3 months ago

Status changed from Workable to Feedback

PW pillar MR:

Actions

Copy link

#13

Updated by nicksinger 3 months ago

I had to follow up fixing two mistakes:

The service is now capable of handling a "reload" command. I guess we will see in the future if our changes get properly applied by grafanas provisioning system. This should cover AC2. AC1 is covered implicitly by not showing the user an error at all at expected deployments (deployments now use "reload": https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1272/diffs#4878b8e726c5a6267ccc8ac5890a0280aad40e9d_219_229).

AC3 should be covered by: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1282

Actions

Copy link

#14

Updated by nicksinger 3 months ago

Status changed from Feedback to Resolved

all merged and deployed.

Actions

Copy link

#15

Updated by okurz 2 months ago

Due date changed from 2024-10-09 to 2024-10-25
Status changed from Resolved to Workable

13.10.2024 03:37:46 root root@monitor.qa.suse.de:

grafana-server.service on host monitor.qe.nue2.suse.org failed to start
please ssh into the host and check systemctl status grafana-server.service for potential reasons

As I have received more alert messages afterwards I assume this was transient and we should look into this message to avoid it in the future

Actions

Copy link

#16

Updated by nicksinger 2 months ago

Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1288

Actions

Copy link

#17

Updated by okurz 2 months ago

merged and deployed. I triggered a reboot of monitor.

Actions

Copy link

#18

Updated by nicksinger 2 months ago

Status changed from Feedback to Resolved

root@monitor:~ # uptime
 18:49:56  up   2:34,  1 user,  load average: 1.13, 1.25, 1.34
root@monitor:~ # systemctl status grafana-server
● grafana-server.service - Grafana instance
     Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/grafana-server.service.d
             └─00-enable-reload.conf, 01-service-fail-mail.conf, override.conf
     Active: active (running) since Tue 2024-10-15 16:16:57 CEST; 2h 33min ago

reboot was conducted and the service came up without alerting us by mail. I purposely broke /etc/grafana/grafana.ini and restarted the service which failed and logged:

root@monitor:~ # systemctl restart grafana-server
Job for grafana-server.service failed because the control process exited with error code.
See "systemctl status grafana-server.service" and "journalctl -xeu grafana-server.service" for details.
root@monitor:~ # systemctl status grafana-server
× grafana-server.service - Grafana instance
     Loaded: loaded (/usr/lib/systemd/system/grafana-server.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/grafana-server.service.d
             └─00-enable-reload.conf, 01-service-fail-mail.conf, override.conf
     Active: failed (Result: exit-code) since Tue 2024-10-15 18:52:58 CEST; 15s ago
   Duration: 2h 35min 51.622s
       Docs: http://docs.grafana.org
    Process: 15139 ExecStart=/usr/share/grafana/bin/grafana server --config=${CONF_FILE} --pidfile=${PID_FILE_DIR}/grafana-server.pid --packaging=rpm cfg:default.paths.logs=${LOG_DIR} cfg:default.paths.data=${DATA_DIR} cfg:default.paths.plugins=${PLUGINS_DIR} cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} (>
   Main PID: 15139 (code=exited, status=1/FAILURE)
        CPU: 506ms

Oct 15 18:52:56 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 4.
Oct 15 18:52:56 monitor systemd[1]: Starting Grafana instance...
Oct 15 18:52:57 monitor grafana[15139]: logger=settings t=2024-10-15T18:52:57.896695559+02:00 level=error msg="failed to parse \"/etc/grafana/grafana.ini\": key-value delimiter not found: NSINGER BROKE THIS _ DELETE ME IF YOU FIND ME AND RESTART GRAFANA\n"
Oct 15 18:52:57 monitor systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 18:52:57 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 5.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Start request repeated too quickly.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Oct 15 18:52:58 monitor systemd[1]: Failed to start Grafana instance.
Oct 15 18:52:58 monitor systemd[1]: grafana-server.service: Triggering OnFailure= dependencies.

And I received an e-mail: So it alerts on "real" issues but keeps silent on reboots.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167257

Grafana aka monitor.qa.suse.de reporting Bad Gateway error - again size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by openqa_review 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by nicksinger 3 months ago

Updated by okurz 2 months ago

Updated by nicksinger 2 months ago

Updated by okurz 2 months ago

Updated by nicksinger 2 months ago