action #133700
openopenQA Project (public) - coordination #155485: [saga][epic] Efficient openQA worker pool resource handling in datacenters
coordination #158374: [epic] Prevention of inefficient hardware resource use
Network bandwidth graphs per switch, like https://mrtg.suse.de/qanet13nue, for all current top-of-rack switches (TORs) that we are connected to size:M
0%
Description
Motivation¶
Sometimes or often enough there are various network related issues. To find out the available bandwidth or bottle necks graphs like https://mrtg.suse.de/qanet13nue/index.html can be quite helpful:
We have those available for NUE1 based switches but I would not know about NUE2 or PRG2 so we should research how something equivalent is possible and ensure everybody within our team would be able to reach the according graphs.
Acceptance criteria¶
- AC1: Graphs like from mrtg.suse.de are available to all current SUSE QE Tools members for common racks, e.g. FC_Basement-B1..5, PRG2-J11+PRG2-J12
- AC2: The team knows how to reach those
Suggestions¶
- Ask Eng-Infra where something like https://mrtg.suse.de/qanet13nue/index.html can be found for all the AC1 mentioned TOR switches
- Ask Eng-Infra if metric collection is already in place
- Figure out if we can just collect these metrics on our own
- Ensure we have ACs covered for both FC Basement and PRG2
- Check e.g. https://www.influxdata.com/integration/jti-openconfig-telemetry/
Files
Updated by mkittler over 1 year ago
- Subject changed from Network bandwidth graphs per switch, like https://mrtg.suse.de/qanet13nue/index.html, for all current top-of-rack switches (TORs) that we are connected to to Network bandwidth graphs per switch, like https://mrtg.suse.de/qanet13nue, for all current top-of-rack switches (TORs) that we are connected to size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
- Target version changed from Ready to Tools - Next
Updated by okurz over 1 year ago
- Target version changed from Tools - Next to Ready
Updated by okurz over 1 year ago
Updated by okurz over 1 year ago
- Related to action #138698: significant increase in multi-machine test failures on OSD since 2023-10-25, e.g. test fails in support_server/setup size:M added
Updated by okurz about 1 year ago
- Assignee set to bschmidt
Please just ask in Slack channel #dct-migration https://app.slack.com/client/T02863RC2AC/C04MDKHQE20?cdn_fallback=2 with a back-reference to this ticket.
Updated by bschmidt about 1 year ago
Updated by bschmidt about 1 year ago
Just adding the slack conversation here:
Jiri Novak
for switches that will be in zabbix, it would make sense to do some dashboard there. but that is not finished yet
Oliver Kurz
I am sure the switches already have something built in. Can we have access to that?
Birger Schmidt
@Jiri Novak Im not quite sure if I understand what you are saying.
Is the info already in zabbix and just the dashboard is missing or is the data collection is not yet done at all?
and if there is nothing in zabbix, is the temporary solution that Oli is pointing to possible in the meantime?
Jiri Novak
collection is not done yet
sorry i'm busy last days with huge migration, so i didn't bother to check which switches are talking about. if it's the enginfra managed ones, they're in separate unrouted network, so no (edited)
Oliver Kurz
@Moroni Flores can you help to ensure this is planned accordingly?
Updated by bschmidt about 1 year ago
Moroni Flores
this and all the monitoring tasks are planned for dec-feb
Oliver Kurz
That sounds great! Any card we can subscribe to or do I need to bug you again? :)
Moroni Flores
I don’t mind if you bug me again but here is the card https://jira.suse.com/browse/ENGINFRA-1893
Oliver Kurz
@Birger Schmidt
now we can set https://progress.opensuse.org/issues/133700 to "Blocked" referencing https://jira.suse.com/browse/ENGINFRA-1893
Updated by okurz about 1 year ago
- Assignee changed from bschmidt to okurz
- Target version changed from Ready to Tools - Next
taking over here tracking https://jira.suse.com/browse/ENGINFRA-1893
Updated by okurz 10 months ago
- Related to coordination #158374: [epic] Prevention of inefficient hardware resource use added