action #40583

Provide job stats for telegraf to poll

Added by lnussel over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:04/09/2018
Priority:NormalDue date:
Assignee:coolo% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

I need to find out how high the load factor of openqa.opensuse.org is. Ie can we handle Tumbleweed and Leap and maintenance with reasonable throughput or do we need to request to buy more worker hardware?

To answer that question maybe plotting the load of all workers, individual workers, job queue, waiting time etc could help. Maybe provide data for consumption by metrics.o.o?


Related issues

Related to openQA Infrastructure - action #18164: [devops][tools] monitoring of openqa worker instances Resolved 25/04/2018

History

#1 Updated by coolo over 1 year ago

Yeah, we're interested in this as well - for a bit longer with a bit more infos required

#2 Updated by coolo over 1 year ago

  • Related to action #18164: [devops][tools] monitoring of openqa worker instances added

#3 Updated by jberry over 1 year ago

Likely the best route would be to expose the information via HTTP where it can then be polled via telegraf on metrics.o.o. If standard system information like disk, cpu, and memory usage is also desired there are a variety of pre-existing solutions for that and can even have the data end up in grafana on metrics.o.o. Either telelgraf running on individual workers or exposing via munin agent or similar. I imagine you focus is the queue sizes and other metrics specific to openQA which would need to be exposed directly.

#4 Updated by lnussel over 1 year ago

For a start a very simple graph would be the number of running and scheduled jobs. Can be polled easily with

curl https://openqa.opensuse.org/api/v1/jobs?state=running,scheduled

#5 Updated by coolo over 1 year ago

Well, 'easily' possible - fast, no:

coolo@openqa:~> time openqa-client jobs state=running,scheduled > /dev/null 

real    0m42.334s
user    0m7.516s

#6 Updated by coolo over 1 year ago

  • Subject changed from visualize workload distribution to Provide job stats for telegraf to poll
  • Target version set to Current Sprint

We need a rather fast json route to report the following infos:

number of running jobs
number of blocked jobs
number of non-blocked scheduled jobs

total, per job group, per ARCH, per worker (host)

#8 Updated by coolo over 1 year ago

  • Assignee set to coolo

https://github.com/os-autoinst/openQA/pull/1829 results in

{
   "stats" : {
      "running" : {
         "by_group" : {
            "openSUSE Leap 42.3 Updates" : 2,
            "openSUSE Krypton" : 3,
            "openSUSE Argon" : 2,
            "openSUSE Leap 15 AArch64" : 2,
            "openSUSE Leap 15.0 Updates" : 1
         },
         "total" : 10,
         "by_arch" : {
            "aarch64" : 2,
            "x86_64" : 8
         },
         "by_host" : {
            "openqaworker1" : 3,
            "openqa-aarch64" : 2,
            "openqaworker4" : 3,
            "imagetester" : 2
         }
      },
      "scheduled" : {
         "by_group" : {
            "openSUSE Leap 15 AArch64" : 5
         },
         "total" : 5,
         "by_arch" : {
            "aarch64" : 5
         }
      },
      "blocked" : {
         "by_group" : {
            "openSUSE Leap 15 AArch64" : 1
         },
         "total" : 1,
         "by_arch" : {
            "aarch64" : 1
         }
      }
   }
}

#9 Updated by coolo over 1 year ago

  • Status changed from New to Resolved

#11 Updated by coolo about 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF