action #122983
[alert] openqa/monitor-o3 failing because openqaworker1 is down size:M
0%
Description
Observation¶
openqa/monitor-o3 is failing because openqaworker1 is down:
PING openqaworker1.openqanet.opensuse.org (192.168.112.6) 56(84) bytes of data. 2388--- openqaworker1.openqanet.opensuse.org ping statistics --- 23891 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
Acceptance criteria¶
- AC1: openqaworker1 is up and survives reboots
Rollback steps¶
- Disable s390x worker slots on rebel again (to use the setup on openqaworker1 again instead).
Suggestions¶
- Try to login
- Reboot via ipmi
Related issues
History
#4
Updated by okurz 2 months ago
w1 runs s390x instances so the impact is more than just x86_64. This was brought up in https://suse.slack.com/archives/C02CANHLANP/p1673523496304249
(Sofia Syrianidou) what's wrong with o3 s390x? I scheduled a couple of test in the morning and they are still not assigned to a worker.
#5
Updated by cdywan 2 months ago
- Blocked by action #123028: A/C broken in TAM lab size:M added
#8
Updated by mkittler 2 months ago
- Status changed from Workable to Blocked
The worker is currently explicitly offline, see blocker. IPMI access works at least (via reverted command).
#9
Updated by cdywan 2 months ago
I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.
#10
Updated by cdywan 2 months ago
cdywan wrote:
I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
#11
Updated by okurz 2 months ago
cdywan wrote:
I guess worker1 should be removed from salt?
No, o3 workers are not in salt. The workers are listed in https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml
Since it's still failing our deployment monitoring.
That'? not a deployment monitoring but an explicit monitoring for o3 workers. Removed the openqaworker1 config for now with
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
And also added the missing openqaworker19+20 in a subsequent commit.
#13
Updated by okurz 2 months ago
- Status changed from Blocked to Feedback
- Priority changed from Urgent to Normal
openqaworker1 monitoring was disabled with
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
and we don't need that machine critically so we can reduce priority.
I created https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 to follow Marius's suggestion to use when: manual
instead of disabled code. And then eventually when openqaworker1 is usable in FC labs, see #119548, we can try to connect the machine again with o3 over routing over different locations.
@Marius I suggest to set this ticket to "Blocked" by #119548 as soon as https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 is merged
#14
Updated by mkittler about 2 months ago
- Status changed from Feedback to Blocked
#15
Updated by mkittler 12 days ago
Looks like the worker is now in FC: https://racktables.suse.de/index.php?page=object&object_id=1260
I couldn't reach it via SSH (from ariel) or IPMI, though. So I guess this ticket is still blocked.
#17
Updated by okurz 12 days ago
mkittler wrote:
I set this ticket to feedback because I'm not sure what other ticket I'm waiting for.
Well, it was #119548 which is resolved so you can continue.
What we can do as next step is to done of the following
- Find the dynamic DHCP lease, e.g. from
ip n
on a neighboring machine -> DONE from qa-jump, no match - Or wait for https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 so that we would be able to find the DHCP lease from the DHCP server directly
- Add the machine into the ops salt repo with both the ipmi+prod Ethernet and use it as experimental OSD worker from FC Basement
- or skip step 3. and make it work as o3 worker,
- 4a. either coordinate with Eng-Infra how to connect it into the o3 network
- 4b. just connect it over the public https interface https://openqa.opensuse.org
#18
Updated by mkittler 11 days ago
I've been creating https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275 for 3.
#20
Updated by okurz 5 days ago
mkittler wrote:
The MR is still pending.
When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.
Yes, of course it wasn't considered yet. That is why we do this exploration task here :) What about 4b? Just connect to https://openqa.opensuse.org?
#22
Updated by mkittler 3 days ago
The MR has been merged but I cannot resolve openqaworker1-ipmi.qe.nue2.suse.org or openqaworker1.qe.nue2.suse.org. I'm using VPN and I can resolve e.g. thincsus.qe.nue2.suse.org so it is likely not a local problem.
I'm also unable to establish an IPMI or SSH connection using the IPs. Maybe this needs on-site investigation?