Project

General

Profile

Actions

action #65975

closed

Monitoring for "scheduled but not executed" (was: perl-Mojolicious-8.37 broke WS connection for (some?) workers)

Added by nicksinger over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Start date:
2020-04-22
Due date:
2020-10-07
% Done:

0%

Estimated time:
Tags:

Description

Motivation

There is a plan to provide more metrics within SUSE, e.g. see https://confluence.suse.com/pages/viewpage.action?pageId=156401717#RalfUnger%E2%80%99sHome-Ideasby , and as we already use grafana we should be able to gather data from there.

Acceptance criteria

  • AC1: A notification on alerts is sent over the usual channels when there are scheduled jobs older than defined limits, e.g. 2 days
  • AC2: A notification on alerts is sent over the usual channels when there are more than X scheduled jobs with less than Y running jobs, e.g. X > 500 scheduled (not counting blocked), Y < 20

Suggestions

Time to starting of tests on openqa.suse.de (how long does it take maximum that a test on openqa.suse.de starts; purpose: How users/teams/departments can integrate openQA tests into their own processes; data source: openQA database queries or existing data on https://stats.openqa-monitor.qa.suse.de ;presentation: time ranges as statistical values, e.g. mean+-std, median, percentiles as numbers as well as graphs for progress)

Original

Today morning, after deployment, I got pinged in RC that the workers don't execute jobs anymore. Looking at the logs I saw:

Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Registering with openQA openqa.suse.de
Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1070
Apr 22 10:59:05 openqaworker2 worker[21179]: [info] [pid:21179] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1070
Apr 22 10:59:11 openqaworker2 worker[21179]: [warn] [pid:21179] Websocket connection to http://openqa.suse.de/api/v1/ws/1070 finished by remote side with code 1006, no reason - trying again in 10 seconds

This was repeated all the time. We "fixed" the problem by downgrading from perl-Mojolicious-8.37 to perl-Mojolicious-8.36 on OSD and restarting the openqa-websockets service.


Related issues 1 (0 open1 closed)

Has duplicate openQA Infrastructure (public) - action #68236: Provide metric about "time to start of tests" on OSDRejectedokurz2020-06-19

Actions
Actions

Also available in: Atom PDF