Project

General

Profile

action #158808

Updated by okurz about 1 month ago

## Motivation 
 See #158550 and #158556. We introduced an alert based on 5xx HTTP responses and found unexpectedly that we have about 120 5xx HTTP responses every hour. We should identify why we have so many hits, fix the problem in either openQA behaviour or the bug in monitoring data and then reduce the alert threshold accordingly. accordingly 

 ## Acceptance criteria 
 * **AC1:** The number of HTTP 5xx errors https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1146/diffs#e5dc019b71ceec2f7e2df28d3e06d72a110fb6b0_1648_1745 is reasonably low 
 * **AC2:** We know how many 500 errors we actually have (so our monitoring doesn't fool us) significantly below 200 

 ## Suggestions 
 * On OSD `grep '" \<500\> ' /var/log/apache2/access_log` which right now looks like this 

 ``` 
 10.149.213.14 - - [10/Apr/2024:03:39:24 +0200] "GET /liveviewhandler/tests/13971445/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 363 
 2a07:de40:b2bf:1b::1117 - - [10/Apr/2024:08:47:18 +0200] "GET /liveviewhandler/tests/13991767/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36" 5079 
 10.149.213.14 - - [10/Apr/2024:11:52:14 +0200] "GET /liveviewhandler/tests/13993065/developer/ws-proxy/status HTTP/1.1" 500 - "-" "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" 586 
 2a07:de40:b203:12:7ec2:55ff:fe24:de70 - - [10/Apr/2024:13:17:33 +0200] "POST /api/v1/mutex/support_server_ready?action=lock HTTP/1.1" 500 860 "-" "Mojolicious (Perl)" 2818 
 ``` 

 *    Also https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=80 shows only very few so likely something counted wrong in the alert query on https://monitor.qa.suse.de/alerting/grafana/d949dbae-8034-4bf4-8418-f148dfcaf89d/view?returnTo=%2Fd%2FWebuiDb%2Fwebui-summary%3ForgId%3D1%26viewPanel%3D80%26editPanel%3D80%26tab%3Dalert . Seems like we are counting 0's and should only count non-zeroes or adjust the query to only return 500's 
 * Fix possible mistakes in the alert and panel queries 
 * Prevent the 500 errors as seen above and fix our monitoring accordingly 
 * After the value ending up in grafana is reduced adjust the alert threshold to a lower sensible value 
 * Confirm that the HTTP status code tracking is correct, or if it needs to be fixed first https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1146/diffs#e5dc019b71ceec2f7e2df28d3e06d72a110fb6b0_1648_1745 
 * Create follow-up tickets for fixing non-trivial causes of 500 responses

Back