action #126212: openqa.suse.de response times very slow. No alert fired size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #126212

closed

openqa.suse.de response times very slow. No alert fired size:M

Added by nicksinger almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-03-20

Due date:

% Done:

Estimated time:

Tags:

infra, reactive work

Description

Observation¶

Tina observed very slow responses from the OSD webui at 10:33 CET. Shortly after we got asked in #eng-testing.
The higher load can be well seen in grafana too: https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314
We received no apache response time alerts as far as I can tell.

Acceptance Criteria¶

AC1: It is known that our alert thresholds are sensible

Suggestions¶

Check what caused the high load e.g. by analyzing the apache log in /var/log/apache2
Remediate the offender (e.g. fixing a script, blocking an IP, etc)
Check why the apache response time alert was not firing and check if something needs to be fixed
- Apache Response Time should have fired?
- Maybe the alert was too relaxed and didn't trigger "yet"?
- Should be 10s but even the index page w/o additional ajax took longer? We don't have numbers, though?

Related issues 5 (1 open — 4 closed)

Related to openQA Project (public) - action #124649: Spotty responses from OSD 2023-02-15

Resolved

livdywan

2023-02-15

Actions

Related to openQA Infrastructure (public) - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M

Resolved

mkittler

2022-09-18

Actions

Related to openQA Project (public) - action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"

Resolved

okurz

2022-06-22

Actions

Related to openQA Project (public) - coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend"

New

2022-06-22

Actions

Related to openQA Infrastructure (public) - action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M

Resolved

okurz

2023-04-01

Actions

Copy link

Updated by nicksinger almost 2 years ago

openqa:/var/log/apache2 # cat access_log | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
   4750 10.84.226.182
   4776 2620:113:80c0:8340::10ac
   5771 2620:113:80c0:8360::10fb
   5789 10.163.42.50
   6397 10.100.12.157
   7835 10.84.227.14
  15051 10.162.191.54
  15443 10.163.29.202
  30227 10.67.20.191
  58676 10.162.191.50

(logs start at [20/Mar/2023:00:36:00 +0100])

openqa:/var/log/apache2 # cat openqa.access_log | cut -d "-" -f 1 | sort | uniq -c | sort -g | tail -n 10
  22274 2a07:de40:a203:12:56ab:3aff:fe16:ddc4
  25917 10.100.96.68
  26375 2a07:de40:a203:12:ec4:7aff:fe7a:789e
  29865 2a07:de40:a203:12:56ab:3aff:fe24:358d
  33131 2620:113:80c0:80a0:10:162:2a:42f0
  39793 10.100.96.76
  44405 2a07:de40:a203:12:ec4:7aff:fe99:dcd9
  47209 10.100.96.74
  55214 10.100.96.72
  98071 2a07:de40:a203:12:2e60:cff:fe73:2ac

(logs start at [20/Mar/2023:08:00:02 +0100])

10.162.191.50 is gitlab-worker3.suse.de and 2a07:de40:a203:12:2e60:cff:fe73:2ac is worker2.oqa.suse.de

Actions

Copy link

Updated by nicksinger almost 2 years ago

openqa:/var/log/apache2 # cat access_log | grep 10.162.191.50 | grep "/api/v1/jobs" | rev | cut -d \" -f 2 | rev | sort | uniq -c
   2369 Mojolicious (Perl)
   1866 openqa-review (https://os-autoinst.github.io/openqa_review)
  50897 python-requests/2.24.0
     72 python-requests/2.28.2

Actions

Copy link

Updated by nicksinger almost 2 years ago

openqa:/var/log/apache2 # cat openqa.access_log | grep 2a07:de40:a203:12:2e60:cff:fe73:2ac | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g | tail -n10
   1945 /api/v1/jobs/10730678/artefact
   1972 /api/v1/jobs/10730677/artefact
   2189 /api/v1/jobs/10730665/artefact
   2194 /api/v1/jobs/10730666/artefact
   2529 /api/v1/jobs/10730675/artefact
   3354 /api/v1/jobs/10730532/status
   3796 /api/v1/jobs/10730631/status
  11900 /api/v1/jobs/10730532/artefact
  15475 /api/v1/jobs/10730754/artefact
  16832 /api/v1/jobs/10730540/artefact

openqa:/var/log/apache2 # cat openqa.access_log | grep 2a07:de40:a203:12:2e60:cff:fe73:2ac | cut -d \" -f 2 | cut -d " " -f 2 | cut -d "?" -f 1 | sort | uniq -c | sort -g | tail -n10 | cut -d "/" -f 5 | xargs -I{} echo "https://openqa.suse.de/tests/{}"

https://openqa.suse.de/tests/10730678
https://openqa.suse.de/tests/10730677
https://openqa.suse.de/tests/10730665
https://openqa.suse.de/tests/10730666
https://openqa.suse.de/tests/10730675
https://openqa.suse.de/tests/10730532
https://openqa.suse.de/tests/10730631
https://openqa.suse.de/tests/10730532
https://openqa.suse.de/tests/10730754
https://openqa.suse.de/tests/10730540

Actions

Copy link

Updated by nicksinger almost 2 years ago

There was quite a high peak in outgoing network traffic in that time frame: https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679303466388&to=1679305843683 - showing 10.162.191.50 with ~3k requests but worker2.oqa.suse.de with a whopping ~16k requests - not sure if they are heavy though.

Actions

Copy link

Updated by okurz over 1 year ago

Tags set to infra, reactive work
Target version set to Ready

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #124649: Spotty responses from OSD 2023-02-15 added

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Related to action #112859: Conduct Five Whys for "[alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" added

Actions

Copy link

Updated by okurz over 1 year ago

Related to coordination #112961: [epic] Followup to "openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend" added

Actions

Copy link

#10

Updated by livdywan over 1 year ago

Subject changed from openqa.suse.de response times very slow. No alert fired to openqa.suse.de response times very slow. No alert fired size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#11

Updated by mkittler over 1 year ago

Assignee set to mkittler

Actions

Copy link

#12

Updated by mkittler over 1 year ago

Status changed from Workable to Feedback

The alert threshold visible on https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=84&orgId=1&editPanel=84&tab=alert&from=1679293281000&to=1679306017000 looks reasonable. The alert would have fired if the condition would have persisted for 30 minutes (but it hasn't). I could reduce this to 15 minutes or so. However, our users would likely still be faster than that so there's no real gain. For that we'd likely reduce it to 5 minutes or even lower. However, that would likely re-introduce false alerts that @okurz previously tried to fix via https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/52563c98d02158b508773386b66d18787a0237a6. So I tend to not change anything at all and consider AC1 fulfilled with that explanation.

Note that the dashboard linked in the description (https://stats.openqa-monitor.qa.suse.de/d/Webuinew/webui-summary-new?orgId=1&from=1679293281205&to=1679306017314&viewPanel=84&editPanel=84&tab=query) has no alert but that's because it isn't the provisioned version of the dashboard/panel.

Actions

Copy link

#13

Updated by okurz over 1 year ago

mkittler wrote:

So I tend to not change anything at all and consider AC1 fulfilled with that explanation.

Then I suggest to try to understand better what was the reason for the problem and what we can do to prevent that in the future.

Actions

Copy link

#14

Updated by mkittler over 1 year ago

What do you expect me to do? No logs from that time have been preserved.

I've just looked at some of the other graphs at this time but haven't seen anything that's very conclusive. The only thing we can say for sure is that the CPU was not very busy around this time. The CPU usage was in fact lower than normal indicating that requests were served slowly because processes were waiting for e.g. I/O.

Actions

Copy link

#15

Updated by okurz over 1 year ago

Yes, that should suffice as explanation. But why are no logs preserved when the issue just happened 29 days ago?

Actions

Copy link

#16

Updated by mkittler over 1 year ago

I've only checked the apache2 logs and the journal. I can try to check further logs but finding the right spot in e.g. openQA logs is a bit of an effort.

Actions

Copy link

#17

Updated by mkittler over 1 year ago

Looks like there are really no openQA logs from that date except for GRU logs as martchus@openqa:~> sudo xzgrep '2023-03-20' /var/log/openqa* only shows occurrences in /var/log/openqa_gru.1.xz to /var/log/openqa_gru.3.xz.

This is actually expected because the main openQA service logs into the journal. There are actually older logs but those are from before switching to the journal.

Why the journal itself doesn't go further I don't know. The size constraints would definitely leave room for more (percentage and total size; for the percentage it also wouldn't matter whether systemd it clever enough to use the percentage of /srv as opposed to /). None of the journal services (systemd-journald.service and systemd-journal-flush.service) show any problems.

Actions

Copy link

#18

Updated by mkittler over 1 year ago

Status changed from Feedback to Resolved

Let's handle the underlying issue as part of #127985 as it is likely the same issue anyways and we have not much to go on for this occurrence. We also know why the alert wasn't fired but I don't think we need to change the grace period at this point.

Actions

Copy link

#19

Updated by okurz over 1 year ago

Related to action #127052: [alert] Apache Response Time alert followed by DatasourceNoData for min_apache_response size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #126212

openqa.suse.de response times very slow. No alert fired size:M

Observation¶

Acceptance Criteria¶

Suggestions¶

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago