action #40583
closedProvide job stats for telegraf to poll
Description
I need to find out how high the load factor of openqa.opensuse.org is. Ie can we handle Tumbleweed and Leap and maintenance with reasonable throughput or do we need to request to buy more worker hardware?
To answer that question maybe plotting the load of all workers, individual workers, job queue, waiting time etc could help. Maybe provide data for consumption by metrics.o.o?
Updated by coolo about 6 years ago
Yeah, we're interested in this as well - for a bit longer with a bit more infos required
Updated by coolo about 6 years ago
- Related to action #18164: [devops][tools] monitoring of openqa worker instances added
Updated by jberry about 6 years ago
Likely the best route would be to expose the information via HTTP where it can then be polled via telegraf on metrics.o.o. If standard system information like disk, cpu, and memory usage is also desired there are a variety of pre-existing solutions for that and can even have the data end up in grafana on metrics.o.o. Either telelgraf running on individual workers or exposing via munin agent or similar. I imagine you focus is the queue sizes and other metrics specific to openQA which would need to be exposed directly.
Updated by lnussel about 6 years ago
For a start a very simple graph would be the number of running and scheduled jobs. Can be polled easily with
curl https://openqa.opensuse.org/api/v1/jobs?state=running,scheduled
Updated by coolo about 6 years ago
Well, 'easily' possible - fast, no:
coolo@openqa:~> time openqa-client jobs state=running,scheduled > /dev/null
real 0m42.334s
user 0m7.516s
Updated by coolo about 6 years ago
- Subject changed from visualize workload distribution to Provide job stats for telegraf to poll
- Target version set to Current Sprint
We need a rather fast json route to report the following infos:
number of running jobs
number of blocked jobs
number of non-blocked scheduled jobs
total, per job group, per ARCH, per worker (host)
Updated by coolo about 6 years ago
Updated by coolo about 6 years ago
- Assignee set to coolo
https://github.com/os-autoinst/openQA/pull/1829 results in
{
"stats" : {
"running" : {
"by_group" : {
"openSUSE Leap 42.3 Updates" : 2,
"openSUSE Krypton" : 3,
"openSUSE Argon" : 2,
"openSUSE Leap 15 AArch64" : 2,
"openSUSE Leap 15.0 Updates" : 1
},
"total" : 10,
"by_arch" : {
"aarch64" : 2,
"x86_64" : 8
},
"by_host" : {
"openqaworker1" : 3,
"openqa-aarch64" : 2,
"openqaworker4" : 3,
"imagetester" : 2
}
},
"scheduled" : {
"by_group" : {
"openSUSE Leap 15 AArch64" : 5
},
"total" : 5,
"by_arch" : {
"aarch64" : 5
}
},
"blocked" : {
"by_group" : {
"openSUSE Leap 15 AArch64" : 1
},
"total" : 1,
"by_arch" : {
"aarch64" : 1
}
}
}
}
Updated by coolo about 6 years ago
- Status changed from New to Resolved
Updated by jberry about 6 years ago
Updated by coolo about 6 years ago
- Target version changed from Current Sprint to Done