Project

General

Profile

Actions

action #40583

closed

Provide job stats for telegraf to poll

Added by lnussel over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-09-04
Due date:
% Done:

0%

Estimated time:

Description

I need to find out how high the load factor of openqa.opensuse.org is. Ie can we handle Tumbleweed and Leap and maintenance with reasonable throughput or do we need to request to buy more worker hardware?

To answer that question maybe plotting the load of all workers, individual workers, job queue, waiting time etc could help. Maybe provide data for consumption by metrics.o.o?


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instancesResolvednicksinger2018-04-25

Actions
Actions #1

Updated by coolo over 5 years ago

Yeah, we're interested in this as well - for a bit longer with a bit more infos required

Actions #2

Updated by coolo over 5 years ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Actions #3

Updated by jberry over 5 years ago

Likely the best route would be to expose the information via HTTP where it can then be polled via telegraf on metrics.o.o. If standard system information like disk, cpu, and memory usage is also desired there are a variety of pre-existing solutions for that and can even have the data end up in grafana on metrics.o.o. Either telelgraf running on individual workers or exposing via munin agent or similar. I imagine you focus is the queue sizes and other metrics specific to openQA which would need to be exposed directly.

Actions #4

Updated by lnussel over 5 years ago

For a start a very simple graph would be the number of running and scheduled jobs. Can be polled easily with

curl https://openqa.opensuse.org/api/v1/jobs?state=running,scheduled

Actions #5

Updated by coolo over 5 years ago

Well, 'easily' possible - fast, no:

coolo@openqa:~> time openqa-client jobs state=running,scheduled > /dev/null 

real    0m42.334s
user    0m7.516s
Actions #6

Updated by coolo over 5 years ago

  • Subject changed from visualize workload distribution to Provide job stats for telegraf to poll
  • Target version set to Current Sprint

We need a rather fast json route to report the following infos:

number of running jobs
number of blocked jobs
number of non-blocked scheduled jobs

total, per job group, per ARCH, per worker (host)

Actions #8

Updated by coolo over 5 years ago

  • Assignee set to coolo

https://github.com/os-autoinst/openQA/pull/1829 results in

{
   "stats" : {
      "running" : {
         "by_group" : {
            "openSUSE Leap 42.3 Updates" : 2,
            "openSUSE Krypton" : 3,
            "openSUSE Argon" : 2,
            "openSUSE Leap 15 AArch64" : 2,
            "openSUSE Leap 15.0 Updates" : 1
         },
         "total" : 10,
         "by_arch" : {
            "aarch64" : 2,
            "x86_64" : 8
         },
         "by_host" : {
            "openqaworker1" : 3,
            "openqa-aarch64" : 2,
            "openqaworker4" : 3,
            "imagetester" : 2
         }
      },
      "scheduled" : {
         "by_group" : {
            "openSUSE Leap 15 AArch64" : 5
         },
         "total" : 5,
         "by_arch" : {
            "aarch64" : 5
         }
      },
      "blocked" : {
         "by_group" : {
            "openSUSE Leap 15 AArch64" : 1
         },
         "total" : 1,
         "by_arch" : {
            "aarch64" : 1
         }
      }
   }
}
Actions #9

Updated by coolo over 5 years ago

  • Status changed from New to Resolved
Actions #11

Updated by coolo over 5 years ago

  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF