Project

General

Profile

Actions

action #129065

closed

[alert] HTTP Response alert fired, OSD loads slow size:M

Added by nicksinger 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-04-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683722604024&to=1683725326412&viewPanel=78 alerted on 2023-05-10 15:07 CEST

Acceptance criteria

  • AC1: The alert is not firing anymore.
  • AC2: Logs have been investigated.

Suggestions

(Joaquin Rivera) is OSD also slow for someone else? (edited)
(Fabian Vogt) That might be partially because of the yast2_nfs_server jobs for investigation. You might want to delete them now that they did their job. (e.g. https://openqa.suse.de/tests/11085729. Don't open, might crash your browser...). those jobs are special. serial_terminal has some race condition so they hammer enter_cmd + assert_script_run in a loop until it fails

Out of scope

  • limiting the number of test result steps uploads or handling the effect of test result steps uploading -> #129068

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #128789: [alert] Apache Response Time alert size:MResolvednicksinger2023-04-01

Actions
Copied to openQA Project - action #129068: Limit the number of uploadable test result steps size:MResolvedmkittler2023-04-01

Actions
Actions #1

Updated by nicksinger 12 months ago

  • Related to action #128789: [alert] Apache Response Time alert size:M added
Actions #2

Updated by nicksinger 12 months ago

  • Description updated (diff)
Actions #3

Updated by okurz 12 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 12 months ago

  • Copied to action #129068: Limit the number of uploadable test result steps size:M added
Actions #5

Updated by mkittler 11 months ago

  • Subject changed from [alert] HTTP Response alert fired, OSD loads slow to [alert] HTTP Response alert fired, OSD loads slow size:M
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Urgent to High
Actions #6

Updated by mkittler 11 months ago

  • Assignee set to mkittler
Actions #7

Updated by mkittler 11 months ago

  • Status changed from Workable to Feedback

There's nothing interesting in the logs during the spikes of high response time (and shortly after/before).

I've also been looking at the apache2 log. At the time of the 1st and 2nd spikse there are many thumbnail requests. However, I don't think it is enough to explain the problem. They're about job 11075965, 11063307 and a few other jobs. So that's actually not just the job mentioned in the job description. I dared to open a few of those jobs they're actually just small ones. The problematic job 11085729 is actually not logged often about in the logs. Maybe because it gets stuck loading even before all the individual thumbnails are loaded.

Slightly before the 2nd spike many requests were made by the qem-bot. Again, that's nothing I think that would have caused this alone - especially because most of those requests were actually slightly before.

I've also been checking the PostgreSQL logs. I've been spotting the following problematic query:

2023-05-10 13:30:03.855 CEST openqa geekotest [22953]ERROR:  syntax error at or near "where" at character 57
2023-05-10 13:30:03.855 CEST openqa geekotest [22953]STATEMENT:  select * from job_settings where key like '%SCC_ADDONS' where t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 13:30:53.799 CEST openqa geekotest [22953]LOG:  duration: 17051.838 ms  statement: select * from job_settings where key like '%SCC_ADDONS' and key != 'SCC_ADDONS' and key != '_XSCC_ADDONS' and t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 13:35:28.503 CEST openqa geekotest [22953]LOG:  duration: 16824.830 ms  statement: select * from job_settings where key like '%INCIDENT_REPO' and key != 'INCIDENT_REPO' and t_created >= '2023-05-01 00:00:00' limit 10;
2023-05-10 14:00:29.355 CEST openqa openqa [21528]LOG:  duration: 22176.099 ms  execute <unnamed>/C_64: -- Metabase
        SELECT "public"."job_modules"."result" AS "result" FROM "public"."job_modules" GROUP BY "public"."job_modules"."result" ORDER BY "public"."job_modules"."result" ASC LIMIT 1000
2023-05-10 14:00:48.447 CEST openqa openqa [21528]LOG:  duration: 19054.975 ms  execute <unnamed>/C_65: -- Metabase 
        SELECT "public"."job_modules"."important" AS "important" FROM "public"."job_modules" GROUP BY "public"."job_modules"."important" ORDER BY "public"."job_modules"."important" ASC LIMIT 1000
2023-05-10 14:01:02.946 CEST openqa openqa [21528]LOG:  duration: 14473.963 ms  execute <unnamed>/C_66: -- Metabase
        SELECT "public"."job_modules"."fatal" AS "fatal" FROM "public"."job_modules" GROUP BY "public"."job_modules"."fatal" ORDER BY "public"."job_modules"."fatal" ASC LIMIT 1000
2023-05-10 14:01:20.319 CEST openqa openqa [21528]LOG:  duration: 17336.367 ms  execute <unnamed>/C_67: -- Metabase 
        SELECT "public"."job_modules"."always_rollback" AS "always_rollback" FROM "public"."job_modules" GROUP BY "public"."job_modules"."always_rollback" ORDER BY "public"."job_modules"."always_rollback" ASC LIMIT 1000
2023-05-10 14:01:34.030 CEST openqa openqa [21528]LOG:  duration: 13671.070 ms  execute <unnamed>/C_68: -- Metabase
        SELECT "public"."job_modules"."milestone" AS "milestone" FROM "public"."job_modules" GROUP BY "public"."job_modules"."milestone" ORDER BY "public"."job_modules"."milestone" ASC LIMIT 1000

However, that's before the relevant timeframe and thus not related.

It looks like someone was hammering the database at some point via COPY statements:

2023-05-10 15:32:09.546 CEST openqa geekotest [32020]LOG:  duration: 13359.954 ms  execute <unnamed>: SELECT me.id, me.script, me.job_id, job.DISTRI, job.VERSION, job.FLAVOR, job.ARCH, job.BUILD, job.TEST, job.id, job.result_dir, job.archived, job.state, job.priority, job.result, job.reason, job.clone_id, job.blocked_by_id, job.TEST, job.DISTRI, job.VERSION, job.FLAVOR, job.ARCH, job.BUILD, job.MACHINE, job.group_id, job.assigned_worker_id, job.t_started, job.t_finished, job.logs_present, job.passed_module_count, job.failed_module_count, job.softfailed_module_count, job.skipped_module_count, job.externally_skipped_module_count, job.scheduled_product_id, job.result_size, job.t_created, job.t_updated FROM job_modules me LEFT JOIN jobs job ON job.id = me.job_id WHERE ( name LIKE $1 ) GROUP BY me.id, job.DISTRI, job.VERSION, job.FLAVOR, job.TEST, job.ARCH, job.MACHINE, job.id ORDER BY job_id DESC LIMIT $2
2023-05-10 15:32:09.546 CEST openqa geekotest [32020]DETAIL:  parameters: $1 = '%worker13%', $2 = '25001'
2023-05-10 16:05:24.710 CEST openqa postgres [27098]LOG:  duration: 57564.296 ms  statement: COPY public.audit_events (id, user_id, connection_id, event, event_data, t_created, t_updated) TO stdout;
2023-05-10 16:08:54.955 CEST openqa postgres [27098]LOG:  duration: 204901.496 ms  statement: COPY public.job_modules (id, job_id, name, script, category, t_created, t_updated, result, milestone, important, fatal, always_rollback) TO stdout;
2023-05-10 16:12:21.055 CEST openqa postgres [27098]LOG:  duration: 206087.081 ms  statement: COPY public.job_settings (id, key, value, job_id, t_created, t_updated) TO stdout;
2023-05-10 16:12:38.912 CEST openqa postgres [27098]LOG:  duration: 17102.952 ms  statement: COPY public.jobs (id, priority, clone_id, t_started, t_finished, t_created, t_updated, state, result, result_dir, group_id, test, distri, version, flavor, arch, build, machine, logs_present, assigned_worker_id, passed_module_count, failed_module_count, softfailed_module_count, skipped_module_count, blocked_by_id, externally_skipped_module_count, scheduled_product_id, reason, result_size, archived) TO stdout;
2023-05-10 16:12:52.960 CEST openqa postgres [27098]LOG:  duration: 14046.894 ms  statement: COPY public.jobs_assets (job_id, asset_id, t_created, t_updated, created_by) TO stdout;
2023-05-10 16:14:10.465 CEST openqa postgres [27098]LOG:  duration: 74644.911 ms  statement: COPY public.scheduled_products (id, distri, version, flavor, arch, build, iso, status, settings, results, user_id, gru_task_id, minion_job_id, t_created, t_updated) TO stdout;
2023-05-10 16:20:16.253 CEST openqa postgres [27098]LOG:  duration: 365763.419 ms  statement: COPY public.screenshot_links (screenshot_id, job_id) TO stdout;
2023-05-10 16:26:52.185 CEST openqa postgres [27098]LOG:  duration: 395860.647 ms  statement: COPY public.screenshots (id, filename, t_created) TO stdout;

The longest of those queries took 6.59767745 minutes but that's of course still not long enough to be part of the relevant timeframe. I'm still wondering who's doing those expensive queries. Maybe they're just part of our backup?

I've also been looking at the full journal. Considering all the velociraptor messages, e.g.

May 10 15:00:11 openqa velociraptor-client[12948]: [INFO] 2023-05-10T15:00:11+02:00 File Ring Buffer: Enqueue {"header":"{\"ReadPointer\":260,\"WritePointer\":2990,\"MaxSize\":1073741874,\"AvailableBytes\":2626,\"LeasedBytes\":0}","leased_pointer":260}

it is hard to spot what's going on. I think the raptor is just very verbose and not necessarily the culprit (considering this highly verbose log message seem to be logged all time). There's definitely room for improvement, though.

Actions #8

Updated by tinita 11 months ago

syntax error at or near "where" at character 57

that was me

It looks like someone was hammering the database at some point via COPY statements:
...
The longest of those queries took 6.59767745 minutes but that's of course still not long enough to be part of the relevant timeframe. I'm still wondering who's doing those expensive queries. Maybe they're just part of our backup?

I mentioned those before (e.g. today when talking about the rows panel in grafana). they are not new and happen quite regularly.

Actions #9

Updated by mkittler 11 months ago

  • Status changed from Feedback to Resolved

Good. So I'm resolving the issue.

Actions

Also available in: Atom PDF