action #126212
closedopenqa.suse.de response times very slow. No alert fired size:M
0%
Description
Observation¶
Tina observed very slow responses from the OSD webui at 10:33 CET. Shortly after we got asked in #eng-testing.
The higher load can be well seen in grafana too: https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314
We received no apache response time alerts as far as I can tell.
Acceptance Criteria¶
- AC1: It is known that our alert thresholds are sensible
Suggestions¶
- Check what caused the high load e.g. by analyzing the apache log in /var/log/apache2
- Remediate the offender (e.g. fixing a script, blocking an IP, etc)
- Check why the apache response time alert was not firing and check if something needs to be fixed
- Apache Response Time should have fired?
- Maybe the alert was too relaxed and didn't trigger "yet"?
- Should be 10s but even the index page w/o additional ajax took longer? We don't have numbers, though?
Updated by nicksinger almost 2 years ago
openqa:/var/log/apache2 # cat access_log | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
4750 10.84.226.182
4776 2620:113:80c0:8340::10ac
5771 2620:113:80c0:8360::10fb
5789 10.163.42.50
6397 10.100.12.157
7835 10.84.227.14
15051 10.162.191.54
15443 10.163.29.202
30227 10.67.20.191
58676 10.162.191.50
(logs start at [20/Mar/2023:00:36:00 +0100])
openqa:/var/log/apache2 # cat openqa.access_log | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
22274 2a07:de40:a203:12:56ab:3aff:fe16:ddc4
25917 10.100.96.68
26375 2a07:de40:a203:12:ec4:7aff:fe7a:789e
29865 2a07:de40:a203:12:56ab:3aff:fe24:358d
33131 2620:113:80c0:80a0:10:162:2a:42f0
39793 10.100.96.76
44405 2a07:de40:a203:12:ec4:7aff:fe99:dcd9
47209 10.100.96.74
55214 10.100.96.72
98071 2a07:de40:a203:12:2e60:cff:fe73:2ac
(logs start at [20/Mar/2023:08:00:02 +0100])
10.162.191.50
is gitlab-worker3.suse.de
and 2a07:de40:a203:12:2e60:cff:fe73:2ac
is worker2.oqa.suse.de
Updated by nicksinger almost 2 years ago
51721/58676 requests made from 10.162.191.50
are to /api/v1/jobs
(cat access_log | grep 10.162.191.50 | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g
). These requests came from the following user agents:
openqa:/var/log/apache2 # cat access_log | grep 10.162.191.50 | grep "/api/v1/jobs" | rev | cut -d \" -f 2 | rev | sort | uniq -c
2369 Mojolicious (Perl)
1866 openqa-review (https://os-autoinst.github.io/openqa_review)
50897 python-requests/2.24.0
72 python-requests/2.28.2
Updated by nicksinger almost 2 years ago
openqa:/var/log/apache2 # cat openqa.access_log | grep 2a07:de40:a203:12:2e60:cff:fe73:2ac | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g | tail -n10
1945 /api/v1/jobs/10730678/artefact
1972 /api/v1/jobs/10730677/artefact
2189 /api/v1/jobs/10730665/artefact
2194 /api/v1/jobs/10730666/artefact
2529 /api/v1/jobs/10730675/artefact
3354 /api/v1/jobs/10730532/status
3796 /api/v1/jobs/10730631/status
11900 /api/v1/jobs/10730532/artefact
15475 /api/v1/jobs/10730754/artefact
16832 /api/v1/jobs/10730540/artefact
openqa:/var/log/apache2 # cat openqa.access_log | grep 2a07:de40:a203:12:2e60:cff:fe73:2ac | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g | tail -n10 | cut -d "/" -f 5 | xargs -I{} echo "https://openqa.suse.de/tests/{}"
https://openqa.suse.de/tests/10730678
https://openqa.suse.de/tests/10730677
https://openqa.suse.de/tests/10730665
https://openqa.suse.de/tests/10730666
https://openqa.suse.de/tests/10730675
https://openqa.suse.de/tests/10730532
https://openqa.suse.de/tests/10730631
https://openqa.suse.de/tests/10730532
https://openqa.suse.de/tests/10730754
https://openqa.suse.de/tests/10730540
Updated by nicksinger almost 2 years ago
There was quite a high peak in outgoing network traffic in that time frame: https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679303466388&to=1679305843683 - showing 10.162.191.50
with ~3k requests but worker2.oqa.suse.de with a whopping ~16k requests - not sure if they are heavy though.
Updated by okurz over 1 year ago
- Tags set to infra, reactive work
- Target version set to Ready
Updated by okurz over 1 year ago
- Related to action #124649: Spotty responses from OSD 2023-02-15 added
Updated by okurz over 1 year ago
- Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Updated by okurz over 1 year ago
- Related to action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" added
Updated by okurz over 1 year ago
- Related to coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" added
Updated by livdywan over 1 year ago
- Subject changed from openqa.suse.de response times very slow. No alert fired to openqa.suse.de response times very slow. No alert fired size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 1 year ago
- Status changed from Workable to Feedback
The alert threshold visible on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=84&orgId=1&editPanel=84&tab=alert&from=1679293281000&to=1679306017000 looks reasonable. The alert would have fired if the condition would have persisted for 30 minutes (but it hasn't). I could reduce this to 15 minutes or so. However, our users would likely still be faster than that so there's no real gain. For that we'd likely reduce it to 5 minutes or even lower. However, that would likely re-introduce false alerts that @okurz previously tried to fix via https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/52563c98d02158b508773386b66d18787a0237a6. So I tend to not change anything at all and consider AC1 fulfilled with that explanation.
Note that the dashboard linked in the description (https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314&viewPanel=84&editPanel=84&tab=query) has no alert but that's because it isn't the provisioned version of the dashboard/panel.
Updated by okurz over 1 year ago
mkittler wrote:
So I tend to not change anything at all and consider AC1 fulfilled with that explanation.
Then I suggest to try to understand better what was the reason for the problem and what we can do to prevent that in the future.
Updated by mkittler over 1 year ago
What do you expect me to do? No logs from that time have been preserved.
I've just looked at some of the other graphs at this time but haven't seen anything that's very conclusive. The only thing we can say for sure is that the CPU was not very busy around this time. The CPU usage was in fact lower than normal indicating that requests were served slowly because processes were waiting for e.g. I/O.
Updated by okurz over 1 year ago
Yes, that should suffice as explanation. But why are no logs preserved when the issue just happened 29 days ago?
Updated by mkittler over 1 year ago
I've only checked the apache2 logs and the journal. I can try to check further logs but finding the right spot in e.g. openQA logs is a bit of an effort.
Updated by mkittler over 1 year ago
Looks like there are really no openQA logs from that date except for GRU logs as martchus@openqa:~> sudo xzgrep '2023-03-20' /var/log/openqa*
only shows occurrences in /var/log/openqa_gru.1.xz
to /var/log/openqa_gru.3.xz
.
This is actually expected because the main openQA service logs into the journal. There are actually older logs but those are from before switching to the journal.
Why the journal itself doesn't go further I don't know. The size constraints would definitely leave room for more (percentage and total size; for the percentage it also wouldn't matter whether systemd it clever enough to use the percentage of /srv
as opposed to /
). None of the journal services (systemd-journald.service
and systemd-journal-flush.service
) show any problems.
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
Let's handle the underlying issue as part of #127985 as it is likely the same issue anyways and we have not much to go on for this occurrence. We also know why the alert wasn't fired but I don't think we need to change the grace period at this point.
Updated by okurz over 1 year ago
- Related to action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added