Project

General

Profile

Actions

action #126212

closed

openqa.suse.de response times very slow. No alert fired size:M

Added by nicksinger over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-03-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

Tina observed very slow responses from the OSD webui at 10:33 CET. Shortly after we got asked in #eng-testing.
The higher load can be well seen in grafana too: https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314
We received no apache response time alerts as far as I can tell.

Acceptance Criteria

  • AC1: It is known that our alert thresholds are sensible

Suggestions

  • Check what caused the high load e.g. by analyzing the apache log in /var/log/apache2
  • Remediate the offender (e.g. fixing a script, blocking an IP, etc)
  • Check why the apache response time alert was not firing and check if something needs to be fixed
    • Apache Response Time should have fired?
    • Maybe the alert was too relaxed and didn't trigger "yet"?
    • Should be 10s but even the index page w/o additional ajax took longer? We don't have numbers, though?

Related issues 5 (1 open4 closed)

Related to openQA Project - action #124649: Spotty responses from OSD 2023-02-15Resolvedlivdywan2023-02-15

Actions
Related to openQA Infrastructure - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:MResolvedmkittler2022-09-18

Actions
Related to openQA Project - action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"Resolvedokurz2022-06-22

Actions
Related to openQA Project - coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"New2022-06-22

Actions
Related to openQA Infrastructure - action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:MResolvedokurz2023-04-01

Actions
Actions #1

Updated by nicksinger over 1 year ago

openqa:/var/log/apache2 # cat access_log | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
   4750 10.84.226.182
   4776 2620:113:80c0:8340::10ac
   5771 2620:113:80c0:8360::10fb
   5789 10.163.42.50
   6397 10.100.12.157
   7835 10.84.227.14
  15051 10.162.191.54
  15443 10.163.29.202
  30227 10.67.20.191
  58676 10.162.191.50

(logs start at [20/Mar/2023:00:36:00 +0100])

openqa:/var/log/apache2 # cat openqa.access_log | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
  22274 2a07:de40:a203:12:56ab:3aff:fe16:ddc4
  25917 10.100.96.68
  26375 2a07:de40:a203:12:ec4:7aff:fe7a:789e
  29865 2a07:de40:a203:12:56ab:3aff:fe24:358d
  33131 2620:113:80c0:80a0:10:162:2a:42f0
  39793 10.100.96.76
  44405 2a07:de40:a203:12:ec4:7aff:fe99:dcd9
  47209 10.100.96.74
  55214 10.100.96.72
  98071 2a07:de40:a203:12:2e60:cff:fe73:2ac

(logs start at [20/Mar/2023:08:00:02 +0100])

10.162.191.50 is gitlab-worker3.suse.de and 2a07:de40:a203:12:2e60:cff:fe73:2ac is worker2.oqa.suse.de

Actions #2

Updated by nicksinger over 1 year ago

51721/58676 requests made from 10.162.191.50 are to /api/v1/jobs (cat access_log | grep 10.162.191.50 | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g). These requests came from the following user agents:

openqa:/var/log/apache2 # cat access_log | grep 10.162.191.50 | grep "/api/v1/jobs" | rev | cut -d \" -f 2 | rev | sort | uniq -c
   2369 Mojolicious (Perl)
   1866 openqa-review (https://os-autoinst.github.io/openqa_review)
  50897 python-requests/2.24.0
     72 python-requests/2.28.2
Actions #3

Updated by nicksinger over 1 year ago

openqa:/var/log/apache2 # cat openqa.access_log | grep 2a07:de40:a203:12:2e60:cff:fe73:2ac | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g | tail -n10
   1945 /api/v1/jobs/10730678/artefact
   1972 /api/v1/jobs/10730677/artefact
   2189 /api/v1/jobs/10730665/artefact
   2194 /api/v1/jobs/10730666/artefact
   2529 /api/v1/jobs/10730675/artefact
   3354 /api/v1/jobs/10730532/status
   3796 /api/v1/jobs/10730631/status
  11900 /api/v1/jobs/10730532/artefact
  15475 /api/v1/jobs/10730754/artefact
  16832 /api/v1/jobs/10730540/artefact
openqa:/var/log/apache2 # cat openqa.access_log | grep 2a07:de40:a203:12:2e60:cff:fe73:2ac | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g | tail -n10 | cut -d "/" -f 5 | xargs -I{} echo "https://openqa.suse.de/tests/{}"

https://openqa.suse.de/tests/10730678
https://openqa.suse.de/tests/10730677
https://openqa.suse.de/tests/10730665
https://openqa.suse.de/tests/10730666
https://openqa.suse.de/tests/10730675
https://openqa.suse.de/tests/10730532
https://openqa.suse.de/tests/10730631
https://openqa.suse.de/tests/10730532
https://openqa.suse.de/tests/10730754
https://openqa.suse.de/tests/10730540

Actions #4

Updated by nicksinger over 1 year ago

There was quite a high peak in outgoing network traffic in that time frame: https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679303466388&to=1679305843683 - showing 10.162.191.50 with ~3k requests but worker2.oqa.suse.de with a whopping ~16k requests - not sure if they are heavy though.

Actions #5

Updated by okurz over 1 year ago

  • Tags set to infra, reactive work
  • Target version set to Ready
Actions #6

Updated by okurz over 1 year ago

  • Related to action #124649: Spotty responses from OSD 2023-02-15 added
Actions #7

Updated by okurz over 1 year ago

  • Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Actions #8

Updated by okurz over 1 year ago

  • Related to action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" added
Actions #9

Updated by okurz over 1 year ago

  • Related to coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" added
Actions #10

Updated by livdywan over 1 year ago

  • Subject changed from openqa.suse.de response times very slow. No alert fired to openqa.suse.de response times very slow. No alert fired size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #11

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #12

Updated by mkittler over 1 year ago

  • Status changed from Workable to Feedback

The alert threshold visible on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=84&orgId=1&editPanel=84&tab=alert&from=1679293281000&to=1679306017000 looks reasonable. The alert would have fired if the condition would have persisted for 30 minutes (but it hasn't). I could reduce this to 15 minutes or so. However, our users would likely still be faster than that so there's no real gain. For that we'd likely reduce it to 5 minutes or even lower. However, that would likely re-introduce false alerts that @okurz previously tried to fix via https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/52563c98d02158b508773386b66d18787a0237a6. So I tend to not change anything at all and consider AC1 fulfilled with that explanation.


Note that the dashboard linked in the description (https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314&viewPanel=84&editPanel=84&tab=query) has no alert but that's because it isn't the provisioned version of the dashboard/panel.

Actions #13

Updated by okurz over 1 year ago

mkittler wrote:

So I tend to not change anything at all and consider AC1 fulfilled with that explanation.

Then I suggest to try to understand better what was the reason for the problem and what we can do to prevent that in the future.

Actions #14

Updated by mkittler over 1 year ago

What do you expect me to do? No logs from that time have been preserved.

I've just looked at some of the other graphs at this time but haven't seen anything that's very conclusive. The only thing we can say for sure is that the CPU was not very busy around this time. The CPU usage was in fact lower than normal indicating that requests were served slowly because processes were waiting for e.g. I/O.

Actions #15

Updated by okurz over 1 year ago

Yes, that should suffice as explanation. But why are no logs preserved when the issue just happened 29 days ago?

Actions #16

Updated by mkittler over 1 year ago

I've only checked the apache2 logs and the journal. I can try to check further logs but finding the right spot in e.g. openQA logs is a bit of an effort.

Actions #17

Updated by mkittler over 1 year ago

Looks like there are really no openQA logs from that date except for GRU logs as martchus@openqa:~> sudo xzgrep '2023-03-20' /var/log/openqa* only shows occurrences in /var/log/openqa_gru.1.xz to /var/log/openqa_gru.3.xz.

This is actually expected because the main openQA service logs into the journal. There are actually older logs but those are from before switching to the journal.

Why the journal itself doesn't go further I don't know. The size constraints would definitely leave room for more (percentage and total size; for the percentage it also wouldn't matter whether systemd it clever enough to use the percentage of /srv as opposed to /). None of the journal services (systemd-journald.service and systemd-journal-flush.service) show any problems.

Actions #18

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

Let's handle the underlying issue as part of #127985 as it is likely the same issue anyways and we have not much to go on for this occurrence. We also know why the alert wasn't fired but I don't think we need to change the grace period at this point.

Actions #19

Updated by okurz over 1 year ago

  • Related to action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added
Actions

Also available in: Atom PDF