action #65975

Monitoring for "scheduled but not executed" (was: perl-Mojolicious-8.37 broke WS connection for (some?) workers)

Added by nicksinger over 1 year ago. Updated 10 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:



There is a plan to provide more metrics within SUSE, e.g. see , and as we already use grafana we should be able to gather data from there.

Acceptance criteria

  • AC1: A notification on alerts is sent over the usual channels when there are scheduled jobs older than defined limits, e.g. 2 days
  • AC2: A notification on alerts is sent over the usual channels when there are more than X scheduled jobs with less than Y running jobs, e.g. X > 500 scheduled (not counting blocked), Y < 20


Time to starting of tests on (how long does it take maximum that a test on starts; purpose: How users/teams/departments can integrate openQA tests into their own processes; data source: openQA database queries or existing data on ;presentation: time ranges as statistical values, e.g. mean+-std, median, percentiles as numbers as well as graphs for progress)


Today morning, after deployment, I got pinged in RC that the workers don't execute jobs anymore. Looking at the logs I saw:

Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Registering with openQA
Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Establishing ws connection via ws://
Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Registered and connected via websockets with openQA host and worker ID 1070
Apr 22 10:59:11 openqaworker2 worker[21179]: [warn] [pid:21179] Websocket connection to finished by remote side with code 1006, no reason - trying again in 10 seconds

This was repeated all the time. We "fixed" the problem by downgrading from perl-Mojolicious-8.37 to perl-Mojolicious-8.36 on OSD and restarting the openqa-websockets service.

Related issues

Has duplicate openQA Infrastructure - action #68236: Provide metric about "time to start of tests" on OSDRejected2020-06-19


#1 Updated by mkittler over 1 year ago

This can be reproduced by running the fullstack test locally. Before upgrading to a fixed version of Mojolicious I can check locally whether it works again. If it works with the fixed version we can deploy it and also update the dependency within the CI.

#2 Updated by okurz over 1 year ago

  • Priority changed from Normal to High is the corresponding upgrade in the dependency file for circle CI tests. There is already a failure that shows similar symptoms.

PR was still open and is now closed unmerged but it shows what has been already installed by automatic upgrades, as we don't have a "stable/tested" repo for all upgraded dependencies of openQA. all o3 and osd workers upgrade from devel:openQA and devel:openQA:Leap:15.1 automatically if the state in github master is fine, osd upgrades if o3 is "fine" and if even the openQA-in-openQA tests are fine. As this problem does not manifest itself as anything that fails in these tests we did not see the problem yet. This might also be because all the mentioned tests are actually not running with the upgraded version though. Also we can apply "Five Whys" and improve something on every level 🙂 . This is something I would like to do with you guys in a synchronous meeting, e.g. retrospective next week

#3 Updated by kraih over 1 year ago

I've investigated and found two upstream bugs in Mojolicious. The first should be the cause for this WebSocket issue and will be resolved with Mojolicious 8.39.

#4 Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
  • Priority changed from High to Normal

With 8.39 I can not reproduce the problem within the fullstack test locally anymore. So it seems to fix the most severe problem.

However, kraih mentioned in the chat:

there was another timeout bug upstream in Mojolicious, so maybe we should depend on 8.40

#5 Updated by okurz over 1 year ago

  • Assignee set to okurz

In the QA tools retro we conducted a "Five Whys" session and identified the following ideas to follow on:

  • We can suggest to kraih that he can test openQA tests against any pending upgrade of packages he maintains, e.g. perl-Mojolicous :)
  • Improve monitoring to look into "scheduled but not executed":

    • detect scheduled jobs older than defined limit, example: 2 days
    • if there are more than X scheduled jobs with less than Y running jobs, alert. Example selections: X > 500 scheduled (not counting blocked), Y < 20
  • We can have a nightly CI job that upgrades dependencies additionally from CPAN and runs openQA tests

#6 Updated by okurz about 1 year ago

  • Subject changed from perl-Mojolicious-8.37 broke WS connection for (some?) workers to Monitoring for "scheduled but not executed" (was: perl-Mojolicious-8.37 broke WS connection for (some?) workers)
  • Description updated (diff)

#7 Updated by okurz about 1 year ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

Adapted the ticket to capture the one idea, added the other idea in #65271#note-14

#8 Updated by okurz about 1 year ago

  • Has duplicate action #68236: Provide metric about "time to start of tests" on OSD added

#9 Updated by okurz about 1 year ago

  • Description updated (diff)

#10 Updated by okurz about 1 year ago

I recommend we start with extending our custom SQL queries in telegraf in

I tried

openqa=> select id,t_started,t_created,state,machine,arch,test from jobs where state != 'done' and state != 'cancelled' order by t_created limit 30;
   id    | t_started |      t_created      |   state   |         machine         |  arch   |                             test                              
 4296324 |           | 2020-05-29 14:46:21 | scheduled | ipmi-tyrion             | x86_64  | install_ltp_baremetal_mlx
 4327400 |           | 2020-06-08 15:02:57 | scheduled | ipmi-tyrion             | x86_64  | ltp_syscalls_baremetal
 4338945 |           | 2020-06-11 02:35:13 | scheduled | ppc64le-hmc-single-disk | ppc64le | gw-online_sles15sp1_pscc_basesys-srv_all_full_zypper_spvm
 4342369 |           | 2020-06-12 01:11:04 | scheduled | ipmi-tyrion             | x86_64  | ltp_syscalls_baremetal
 4366399 |           | 2020-06-18 13:06:33 | scheduled | 64bit                   | x86_64  | hpc_DELTA_slurm_slave00
 4367363 |           | 2020-06-18 17:07:48 | scheduled | 64bit                   | x86_64  | hpc_DELTA_slurm_accounting
 4367523 |           | 2020-06-18 18:08:13 | scheduled | 64bit                   | x86_64  | hpc_DELTA_slurm_accounting_supportserver

this shows a lost of tests that have been scheduled a bit longer ago and are waiting to be executed. The first jobs are scheduled against a machine that does not seem to have any workers with matching WORKER_CLASS behind so I would handle them separately and exclude them here for now. Also jobs like are a bit special because they have been scheduled with WORKER_CLASS=openqaworker-arm-3 but ARCH=x86_64 which also does not make any sense. I think it can still be beneficial to alert about these though. I am thinking about ignoring outliers so e.g. check median(NOW() - t_created) against a threshold. describes calculating percentiles in postgres.

Also experimenting with calculating in grafana, e.g.

SELECT (last(scheduled)+last(blocked))/last("running") FROM "openqa_jobs" WHERE $timeFilter GROUP BY time($__interval) fill(previous)

this for example shows a high value exceeding 200 on 2020-04-25 which might link to the original event but I am not convinced of this metric. Does not seem to show events like we had on 2020-06-16 when workers refused to execute tests.

EDIT: Discussed with szarate as I had a call with him anyway :) He favors in particular graphs of the scheduled vs. worker class. A derived single-value metric would then be the "start time per worker class". However we do not store the specific worker class in the jobs table of the database. For a start the "machine" might suffice. Too specific worker class settings, e.g. in test suites, are likely a test configuration smell that should be avoided.

#11 Updated by okurz about 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

With and followups we have a "job age" collected from telegraf. Maybe with my experiments I messed up a bit. Struggling with influx data. postgresql,arch=x86_64,db=postgres,host=openqa,machine=64bit,server=dbname\=openqa\ host\=localhost\ user\=telegraf job_age_p100=9550i,job_age_p50=9522i,job_age_p90=9548i,job_age_p99=9550i 1592978793000000000 looks like key=value is "job_age_p100=…" and has tags for "arch" and "machine" but I can not see recent tags in manual influxdb queries and in grafana I can also only see the values but not select on the tags. maybe is the problem as I did the mistake to not specify the tag values previously. So with I change the tag name to "test_machine" and delete arch because maybe we can live without, machine should be enough.

My primary target for the "job age" is to prevent these (embarassing) moments when QA engineers like you need to ping us in chat and ask "hey, are all arm workers down since 3 days and nobody noticed?". But not only that. For example the 100% percentile can tell us: "There are tests scheduled against a worker class but there is no such worker" -> inform tester about misconfiguration, 99% percentile "these jobs are taking ages and we always need to wait for these" -> inform tester to optimize these cases as this blocks the reporting, 90% percentile "this is the bunch that takes longer than most of the rest" -> invest here with additional worker ressources, 50% percentile "this is about the time you can expect an arbitrary job to wait before it gets picked up" -> good to know for everyone to know what to expect.

The job age is off by 2h already coming from the SQL query. I need to account for timezone diff. Fixed with

Added two singlestat panels to for "best" and "worst" case along with alerts.
Now the dashboard also nicely shows that the "worst" case is in red state due to misconfigured jobs:

openqa=> select arch,machine,percentile_cont(.5) within group (order by age) as p50, percentile_disc(.9) within group (order by age) as p90, percentile_disc(.99) within group (order by age) as p99, percentile_disc(1) within group (order by age) as p100 from (select id,state,machine,arch,test,(NOW() - t_created) as age from jobs where state != 'done' and state != 'cancelled' order by age) as job_ages group by machine,arch;
  arch   |         machine         |           p50           |           p90           |           p99           |          p100
 x86_64  | 64bit                   | 07:01:47.756841         | 08:01:18.756841         | 08:01:55.756841         | 08:02:19.756841
 aarch64 | aarch64                 | 04:02:00.756841         | 04:02:00.756841         | 04:02:00.756841         | 04:02:00.756841
 x86_64  | az_Standard_A2_v2       | 09:02:09.756841         | 09:02:10.756841         | 09:02:10.756841         | 09:02:10.756841
 x86_64  | ec2_t2.large            | 09:02:09.756841         | 09:02:11.756841         | 09:02:11.756841         | 09:02:11.756841
 x86_64  | gce_n1_standard_2       | 09:02:10.756841         | 09:02:11.756841         | 09:02:11.756841         | 09:02:11.756841
 x86_64  | ipmi-tyrion             | 15 days 07:05:07.756841 | 20 days 08:28:50.756841 | 20 days 08:28:50.756841 | 20 days 08:28:50.756841
 ppc64le | ppc64le-hmc-single-disk | 12 days 19:32:50.756841 | 12 days 19:32:50.756841 | 12 days 19:32:50.756841 | 12 days 19:32:50.756841
 s390x   | zkvm                    | 06:02:31.756841         | 06:02:38.756841         | 06:02:38.756841         | 06:02:38.756841
(8 rows)

So I cleaned up the old ones:

delete from jobs where state != 'done' and state != 'cancelled' and t_created < (now() - interval '10d');

Created to save the complete new dashboard and deploy with salt.

#12 Updated by okurz about 1 year ago

  • Status changed from In Progress to Feedback

Apparently there are some worker classes which are really limited triggering alerts which most likely we do not want to act on, e.g.

$ openqa-cli api --osd jobs state=scheduled | jq -r '.jobs | .[] | .id,.name'

are cloud image jobs that have been scheduled since more than 4h ago. Will bump the alert limit to 8h but I fear it will loose its original purpose with this. I guess we need to find something better. And I don't think alert per machine is easily feasible. Feedback would be appreciated.

EDIT: 2020-06-28: Apologies for the many recent alerts on . I understood that this apparently overwhelms all of you and hence you missed other alerts. As a remedy I know changed the alert notification channel to go to only for the "job age (median)" panel

#13 Updated by okurz about 1 year ago

  • Target version set to Ready

#14 Updated by okurz about 1 year ago

  • Priority changed from Normal to Low

Every time I receive the alert it's actually wrongly configured jobs. This is for the "max" class, the other alert still disabled. Not exactly sure how to proceed. Maybe I will come up with a good idea some time later.

#15 Updated by okurz 12 months ago

  • Description updated (diff)

#16 Updated by okurz 12 months ago

  • Due date set to 2020-10-01

I have re-enabled the alert for "Job age (scheduled) (median)" as it did not trigger for more than a month and I think if it would, then it would be valid. We need to handle this again whenever it happens but I can monitor myself for the next two months or so.

#17 Updated by okurz 11 months ago

The alert failed again after more builds of SLE15SP3 were triggered but due to #69727 the capacity was limited. As also users asked about the limited testing capacity the alert was valid and telling us what we needed to know anyway. After I could bring up openqaworker-arm-3 again and seeing that looks fine I unpaused the alert now but will still keep myself as the recipient as I know my Pappenheimer :)

#18 Updated by okurz 11 months ago

the alert triggered again because currently there are a lot of aarch64 HPC jobs running and scheduled but nearly no other. I don't think anything is wrong in particular here though. Looking for better ideas what to do :(

#20 Updated by okurz 10 months ago

  • Due date changed from 2020-10-01 to 2020-10-07

Proposing the enablement of the median job age alert for with

#21 Updated by okurz 10 months ago

  • Status changed from Feedback to Resolved

The MR is merged and the alert is active with sending messages to . It actually triggered because someone scheduled jobs that could never run because there was no matching worker class. I saw the alert but pretended I would not have seen it and I was happy to see that mkittler took the alert serious and addressed it :) With this I am now confident that my immediate work is done and we can close this ticket as "Resolved".

Also available in: Atom PDF