action #129065
closed[alert] HTTP Response alert fired, OSD loads slow size:M
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683722604024&to=1683725326412&viewPanel=78 alerted on 2023-05-10 15:07 CEST
Acceptance criteria¶
- AC1: The alert is not firing anymore.
- AC2: Logs have been investigated.
Suggestions¶
- Look into the timeframe https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683723624920&to=1683724305517 and compare to other panels on OSD if it's visible what made the system busy DONE: nothing too unusual. Maybe a little high IO times but far from concerning
@okurz suggested in https://suse.slack.com/archives/C02AJ1E568M/p1683724668733689?thread_ts=1683724103.321589&cid=C02AJ1E568M that it might be caused by something we don't collect metrics from - brainstorm what these could be, implement metrics for them
- Open network connections - nsinger observed peaks of >2k, ~75% of them related to httpd-prefork, ~20% to openqa-websocket
- > (Nick Singer) I'm currently logged into OSD. CPU utilization is quite high with a longterm load of 12 and shortterm of ~14 with only 12 cores on OSD. velociraptor goes up to 200% and is in general quite high in the process list but also telegraf and obviously openqa itself. > (Oliver Kurz) all of that sounds fine. When the HTTP response was high I just took a look and the CPU usage was near 0 same as we suspected in the past. Remember our debugging on why qanet is slow? Comparable to that but here it's likely apache, number of concurrent connections, something like that
Take https://suse.slack.com/archives/C02CANHLANP/p1683723956965209 into account - is there something we can do to improve this situation?
(Joaquin Rivera) is OSD also slow for someone else? (edited)
(Fabian Vogt) That might be partially because of the yast2_nfs_server jobs for investigation. You might want to delete them now that they did their job. (e.g. https://openqa.suse.de/tests/11085729. Don't open, might crash your browser...). those jobs are special. serial_terminal has some race condition so they hammer enter_cmd + assert_script_run in a loop until it fails
Out of scope¶
- limiting the number of test result steps uploads or handling the effect of test result steps uploading -> #129068
Updated by nicksinger over 1 year ago
- Related to action #128789: [alert] Apache Response Time alert size:M added
Updated by okurz over 1 year ago
- Copied to action #129068: Limit the number of uploadable test result steps size:M added
Updated by mkittler over 1 year ago
- Subject changed from [alert] HTTP Response alert fired, OSD loads slow to [alert] HTTP Response alert fired, OSD loads slow size:M
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Urgent to High
Updated by mkittler over 1 year ago
- Status changed from Workable to Feedback
There's nothing interesting in the logs during the spikes of high response time (and shortly after/before).
I've also been looking at the apache2 log. At the time of the 1st and 2nd spikse there are many thumbnail requests. However, I don't think it is enough to explain the problem. They're about job 11075965, 11063307 and a few other jobs. So that's actually not just the job mentioned in the job description. I dared to open a few of those jobs they're actually just small ones. The problematic job 11085729 is actually not logged often about in the logs. Maybe because it gets stuck loading even before all the individual thumbnails are loaded.
Slightly before the 2nd spike many requests were made by the qem-bot. Again, that's nothing I think that would have caused this alone - especially because most of those requests were actually slightly before.
I've also been checking the PostgreSQL logs. I've been spotting the following problematic query:
2023-05-10 13:30:03.855 CEST openqa geekotest [22953]ERROR: syntax error at or near "where" at character 57
2023-05-10 13:30:03.855 CEST openqa geekotest [22953]STATEMENT: select * from job_settings where key like '%SCC_ADDONS' where t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 13:30:53.799 CEST openqa geekotest [22953]LOG: duration: 17051.838 ms statement: select * from job_settings where key like '%SCC_ADDONS' and key != 'SCC_ADDONS' and key != '_XSCC_ADDONS' and t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 13:35:28.503 CEST openqa geekotest [22953]LOG: duration: 16824.830 ms statement: select * from job_settings where key like '%INCIDENT_REPO' and key != 'INCIDENT_REPO' and t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 14:00:29.355 CEST openqa openqa [21528]LOG: duration: 22176.099 ms execute <unnamed>/C_64: -- Metabase
SELECT "public"."job_modules"."result" AS "result" FROM "public"."job_modules" GROUP BY "public"."job_modules"."result" ORDER BY "public"."job_modules"."result" ASC LIMIT 1000
2023-05-10 14:00:48.447 CEST openqa openqa [21528]LOG: duration: 19054.975 ms execute <unnamed>/C_65: -- Metabase
SELECT "public"."job_modules"."important" AS "important" FROM "public"."job_modules" GROUP BY "public"."job_modules"."important" ORDER BY "public"."job_modules"."important" ASC LIMIT 1000
2023-05-10 14:01:02.946 CEST openqa openqa [21528]LOG: duration: 14473.963 ms execute <unnamed>/C_66: -- Metabase
SELECT "public"."job_modules"."fatal" AS "fatal" FROM "public"."job_modules" GROUP BY "public"."job_modules"."fatal" ORDER BY "public"."job_modules"."fatal" ASC LIMIT 1000
2023-05-10 14:01:20.319 CEST openqa openqa [21528]LOG: duration: 17336.367 ms execute <unnamed>/C_67: -- Metabase
SELECT "public"."job_modules"."always_rollback" AS "always_rollback" FROM "public"."job_modules" GROUP BY "public"."job_modules"."always_rollback" ORDER BY "public"."job_modules"."always_rollback" ASC LIMIT 1000
2023-05-10 14:01:34.030 CEST openqa openqa [21528]LOG: duration: 13671.070 ms execute <unnamed>/C_68: -- Metabase
SELECT "public"."job_modules"."milestone" AS "milestone" FROM "public"."job_modules" GROUP BY "public"."job_modules"."milestone" ORDER BY "public"."job_modules"."milestone" ASC LIMIT 1000
However, that's before the relevant timeframe and thus not related.
It looks like someone was hammering the database at some point via COPY
statements:
2023-05-10 15:32:09.546 CEST openqa geekotest [32020]LOG: duration: 13359.954 ms execute <unnamed>: SELECT me.id, me.script, me.job_id, job.DISTRI, job.VERSION, job.FLAVOR, job.ARCH, job.BUILD, job.TEST, job.id, job.result_dir, job.archived, job.state, job.priority, job.result, job.reason, job.clone_id, job.blocked_by_id, job.TEST, job.DISTRI, job.VERSION, job.FLAVOR, job.ARCH, job.BUILD, job.MACHINE, job.group_id, job.assigned_worker_id, job.t_started, job.t_finished, job.logs_present, job.passed_module_count, job.failed_module_count, job.softfailed_module_count, job.skipped_module_count, job.externally_skipped_module_count, job.scheduled_product_id, job.result_size, job.t_created, job.t_updated FROM job_modules me LEFT JOIN jobs job ON job.id = me.job_id WHERE ( name LIKE $1 ) GROUP BY me.id, job.DISTRI, job.VERSION, job.FLAVOR, job.TEST, job.ARCH, job.MACHINE, job.id ORDER BY job_id DESC LIMIT $2
2023-05-10 15:32:09.546 CEST openqa geekotest [32020]DETAIL: parameters: $1 = '%worker13%', $2 = '25001'
2023-05-10 16:05:24.710 CEST openqa postgres [27098]LOG: duration: 57564.296 ms statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-05-10 16:08:54.955 CEST openqa postgres [27098]LOG: duration: 204901.496 ms statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-05-10 16:12:21.055 CEST openqa postgres [27098]LOG: duration: 206087.081 ms statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-05-10 16:12:38.912 CEST openqa postgres [27098]LOG: duration: 17102.952 ms statement: COPY public.jobs (id, priority, clone_id, t_started, t_finished, t_created, t_updated, state, result, result_dir, group_id, test, distri, version, flavor, arch, build, machine, logs_present, assigned_worker_id, passed_module_count, failed_module_count, softfailed_module_count, skipped_module_count, blocked_by_id, externally_skipped_module_count, scheduled_product_id, reason, result_size, archived) TO stdout;
2023-05-10 16:12:52.960 CEST openqa postgres [27098]LOG: duration: 14046.894 ms statement: COPY public.jobs_assets (job_id, asset_id, t_created, t_updated, created_by) TO stdout;
2023-05-10 16:14:10.465 CEST openqa postgres [27098]LOG: duration: 74644.911 ms statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-05-10 16:20:16.253 CEST openqa postgres [27098]LOG: duration: 365763.419 ms statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-05-10 16:26:52.185 CEST openqa postgres [27098]LOG: duration: 395860.647 ms statement: COPY public.screenshots (id, filename, t_created) TO stdout;
The longest of those queries took 6.59767745 minutes but that's of course still not long enough to be part of the relevant timeframe. I'm still wondering who's doing those expensive queries. Maybe they're just part of our backup?
I've also been looking at the full journal. Considering all the velociraptor messages, e.g.
May 10 15:00:11 openqa velociraptor-client[12948]: [INFO] 2023-05-10T15:00:11+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":260,\"WritePointer\":2990,\"MaxSize\":1073741874,\"AvailableBytes\":2626,\"LeasedBytes\":0}","leased_pointer":260}
it is hard to spot what's going on. I think the raptor is just very verbose and not necessarily the culprit (considering this highly verbose log message seem to be logged all time). There's definitely room for improvement, though.
Updated by tinita over 1 year ago
syntax error at or near "where" at character 57
that was me
It looks like someone was hammering the database at some point via COPY statements:
...
The longest of those queries took 6.59767745 minutes but that's of course still not long enough to be part of the relevant timeframe. I'm still wondering who's doing those expensive queries. Maybe they're just part of our backup?
I mentioned those before (e.g. today when talking about the rows panel in grafana). they are not new and happen quite regularly.
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
Good. So I'm resolving the issue.