action #129065: [alert] HTTP Response alert fired, OSD loads slow size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #129065

closed

[alert] HTTP Response alert fired, OSD loads slow size:M

Added by nicksinger about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-04-01

Due date:

% Done:

Estimated time:

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683722604024&to=1683725326412&viewPanel=78 alerted on 2023-05-10 15:07 CEST

Acceptance criteria¶

AC1: The alert is not firing anymore.
AC2: Logs have been investigated.

Suggestions¶

Look into the timeframe
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683723624920&to=1683724305517 and compare to other panels on OSD if it's visible what made the system busy DONE: nothing too unusual. Maybe a little high IO times but far from concerning
@okurz suggested in https://suse.slack.com/archives/C02AJ1E568M/p1683724668733689?thread_ts=1683724103.321589&cid=C02AJ1E568M that it might be caused by something we don't collect metrics from - brainstorm what these could be, implement metrics for them
- Open network connections - nsinger observed peaks of >2k, ~75% of them related to httpd-prefork, ~20% to openqa-websocket
- (Nick Singer) I'm currently logged into OSD. CPU utilization is quite high with a longterm load of 12 and shortterm of ~14 with only 12 cores on OSD. velociraptor goes up to 200% and is in general quite high in the process list but also telegraf and obviously openqa itself.
  (Oliver Kurz) all of that sounds fine. When the HTTP response was high I just took a look and the CPU usage was near 0 same as we suspected in the past. Remember our debugging on why qanet is slow? Comparable to that but here it's likely apache, number of concurrent connections, something like that
Take https://suse.slack.com/archives/C02CANHLANP/p1683723956965209 into account - is there something we can do to improve this situation?

(Joaquin Rivera) is OSD also slow for someone else? (edited)
(Fabian Vogt) That might be partially because of the yast2_nfs_server jobs for investigation. You might want to delete them now that they did their job. (e.g. https://openqa.suse.de/tests/11085729. Don't open, might crash your browser...). those jobs are special. serial_terminal has some race condition so they hammer enter_cmd + assert_script_run in a loop until it fails

Out of scope¶

limiting the number of test result steps uploads or handling the effect of test result steps uploading -> #129068

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by nicksinger about 2 years ago

Related to action #128789: [alert] Apache Response Time alert size:M added

Actions

Copy link

Updated by nicksinger about 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 2 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 2 years ago

Copied to action #129068: Limit the number of uploadable test result steps size:M added

Actions

Copy link

Updated by mkittler about 2 years ago

Subject changed from [alert] HTTP Response alert fired, OSD loads slow to [alert] HTTP Response alert fired, OSD loads slow size:M
Description updated (diff)
Status changed from New to Workable
Priority changed from Urgent to High

Actions

Copy link

Updated by mkittler about 2 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler about 2 years ago

Status changed from Workable to Feedback

There's nothing interesting in the logs during the spikes of high response time (and shortly after/before).

I've also been looking at the apache2 log. At the time of the 1st and 2nd spikse there are many thumbnail requests. However, I don't think it is enough to explain the problem. They're about job 11075965, 11063307 and a few other jobs. So that's actually not just the job mentioned in the job description. I dared to open a few of those jobs they're actually just small ones. The problematic job 11085729 is actually not logged often about in the logs. Maybe because it gets stuck loading even before all the individual thumbnails are loaded.

Slightly before the 2nd spike many requests were made by the qem-bot. Again, that's nothing I think that would have caused this alone - especially because most of those requests were actually slightly before.

I've also been checking the PostgreSQL logs. I've been spotting the following problematic query:

2023-05-10 13:30:03.855 CEST openqa geekotest [22953]ERROR:  syntax error at or near "where" at character 57
2023-05-10 13:30:03.855 CEST openqa geekotest [22953]STATEMENT:  select * from job_settings where key like '%SCC_ADDONS' where t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 13:30:53.799 CEST openqa geekotest [22953]LOG:  duration: 17051.838 ms  statement: select * from job_settings where key like '%SCC_ADDONS' and key != 'SCC_ADDONS' and key != '_XSCC_ADDONS' and t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 13:35:28.503 CEST openqa geekotest [22953]LOG:  duration: 16824.830 ms  statement: select * from job_settings where key like '%INCIDENT_REPO' and key != 'INCIDENT_REPO' and t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 14:00:29.355 CEST openqa openqa [21528]LOG:  duration: 22176.099 ms  execute <unnamed>/C_64: -- Metabase
        SELECT "public"."job_modules"."result" AS "result" FROM "public"."job_modules" GROUP BY "public"."job_modules"."result" ORDER BY "public"."job_modules"."result" ASC LIMIT 1000
2023-05-10 14:00:48.447 CEST openqa openqa [21528]LOG:  duration: 19054.975 ms  execute <unnamed>/C_65: -- Metabase 
        SELECT "public"."job_modules"."important" AS "important" FROM "public"."job_modules" GROUP BY "public"."job_modules"."important" ORDER BY "public"."job_modules"."important" ASC LIMIT 1000
2023-05-10 14:01:02.946 CEST openqa openqa [21528]LOG:  duration: 14473.963 ms  execute <unnamed>/C_66: -- Metabase
        SELECT "public"."job_modules"."fatal" AS "fatal" FROM "public"."job_modules" GROUP BY "public"."job_modules"."fatal" ORDER BY "public"."job_modules"."fatal" ASC LIMIT 1000
2023-05-10 14:01:20.319 CEST openqa openqa [21528]LOG:  duration: 17336.367 ms  execute <unnamed>/C_67: -- Metabase 
        SELECT "public"."job_modules"."always_rollback" AS "always_rollback" FROM "public"."job_modules" GROUP BY "public"."job_modules"."always_rollback" ORDER BY "public"."job_modules"."always_rollback" ASC LIMIT 1000
2023-05-10 14:01:34.030 CEST openqa openqa [21528]LOG:  duration: 13671.070 ms  execute <unnamed>/C_68: -- Metabase
        SELECT "public"."job_modules"."milestone" AS "milestone" FROM "public"."job_modules" GROUP BY "public"."job_modules"."milestone" ORDER BY "public"."job_modules"."milestone" ASC LIMIT 1000

However, that's before the relevant timeframe and thus not related.

It looks like someone was hammering the database at some point via COPY statements:

2023-05-10 15:32:09.546 CEST openqa geekotest [32020]LOG:  duration: 13359.954 ms  execute <unnamed>: SELECT me.id, me.script, me.job_id, job.DISTRI, job.VERSION, job.FLAVOR, job.ARCH, job.BUILD, job.TEST, job.id, job.result_dir, job.archived, job.state, job.priority, job.result, job.reason, job.clone_id, job.blocked_by_id, job.TEST, job.DISTRI, job.VERSION, job.FLAVOR, job.ARCH, job.BUILD, job.MACHINE, job.group_id, job.assigned_worker_id, job.t_started, job.t_finished, job.logs_present, job.passed_module_count, job.failed_module_count, job.softfailed_module_count, job.skipped_module_count, job.externally_skipped_module_count, job.scheduled_product_id, job.result_size, job.t_created, job.t_updated FROM job_modules me LEFT JOIN jobs job ON job.id = me.job_id WHERE ( name LIKE $1 ) GROUP BY me.id, job.DISTRI, job.VERSION, job.FLAVOR, job.TEST, job.ARCH, job.MACHINE, job.id ORDER BY job_id DESC LIMIT $2
2023-05-10 15:32:09.546 CEST openqa geekotest [32020]DETAIL:  parameters: $1 = '%worker13%', $2 = '25001'
2023-05-10 16:05:24.710 CEST openqa postgres [27098]LOG:  duration: 57564.296 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-05-10 16:08:54.955 CEST openqa postgres [27098]LOG:  duration: 204901.496 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-05-10 16:12:21.055 CEST openqa postgres [27098]LOG:  duration: 206087.081 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-05-10 16:12:38.912 CEST openqa postgres [27098]LOG:  duration: 17102.952 ms  statement: COPY public.jobs (id, priority, clone_id, t_started, t_finished, t_created, t_updated, state, result, result_dir, group_id, test, distri, version, flavor, arch, build, machine, logs_present, assigned_worker_id, passed_module_count, failed_module_count, softfailed_module_count, skipped_module_count, blocked_by_id, externally_skipped_module_count, scheduled_product_id, reason, result_size, archived) TO stdout;
2023-05-10 16:12:52.960 CEST openqa postgres [27098]LOG:  duration: 14046.894 ms  statement: COPY public.jobs_assets (job_id, asset_id, t_created, t_updated, created_by) TO stdout;
2023-05-10 16:14:10.465 CEST openqa postgres [27098]LOG:  duration: 74644.911 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-05-10 16:20:16.253 CEST openqa postgres [27098]LOG:  duration: 365763.419 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-05-10 16:26:52.185 CEST openqa postgres [27098]LOG:  duration: 395860.647 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;

The longest of those queries took 6.59767745 minutes but that's of course still not long enough to be part of the relevant timeframe. I'm still wondering who's doing those expensive queries. Maybe they're just part of our backup?

I've also been looking at the full journal. Considering all the velociraptor messages, e.g.

May 10 15:00:11 openqa velociraptor-client[12948]: [INFO] 2023-05-10T15:00:11+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":260,\"WritePointer\":2990,\"MaxSize\":1073741874,\"AvailableBytes\":2626,\"LeasedBytes\":0}","leased_pointer":260}

it is hard to spot what's going on. I think the raptor is just very verbose and not necessarily the culprit (considering this highly verbose log message seem to be logged all time). There's definitely room for improvement, though.

Actions

Copy link

Updated by tinita about 2 years ago

syntax error at or near "where" at character 57

that was me

It looks like someone was hammering the database at some point via COPY statements:
...
The longest of those queries took 6.59767745 minutes but that's of course still not long enough to be part of the relevant timeframe. I'm still wondering who's doing those expensive queries. Maybe they're just part of our backup?

I mentioned those before (e.g. today when talking about the rows panel in grafana). they are not new and happen quite regularly.

Actions

Copy link

Updated by mkittler about 2 years ago

Status changed from Feedback to Resolved

Good. So I'm resolving the issue.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #129065

[alert] HTTP Response alert fired, OSD loads slow size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Out of scope¶

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by tinita about 2 years ago

Updated by mkittler about 2 years ago