Project

General

Profile

action #122983

[alert] openqa/monitor-o3 failing because openqaworker1 is down size:M

Added by cdywan 2 months ago. Updated 3 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
Target version:
Start date:
2023-01-02
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

openqa/monitor-o3 is failing because openqaworker1 is down:

PING openqaworker1.openqanet.opensuse.org (192.168.112.6) 56(84) bytes of data.
2388--- openqaworker1.openqanet.opensuse.org ping statistics ---
23891 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Acceptance criteria

  • AC1: openqaworker1 is up and survives reboots

Rollback steps

  • Disable s390x worker slots on rebel again (to use the setup on openqaworker1 again instead).

Suggestions

  • Try to login
  • Reboot via ipmi

Related issues

Blocked by openQA Infrastructure - action #123028: A/C broken in TAM lab size:MResolved2023-01-12

History

#1 Updated by cdywan 2 months ago

  • Subject changed from [alert] openqaworker1 to [alert] openqa/monitor-o3 failing because openqaworker1 is down

#2 Updated by okurz 2 months ago

  • Tags set to infra
  • Priority changed from High to Urgent

As long as the monitoring pipeline is active it will bug about this, so this needs urgent handling, at best today

#3 Updated by cdywan 2 months ago

  • Subject changed from [alert] openqa/monitor-o3 failing because openqaworker1 is down to [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by okurz 2 months ago

w1 runs s390x instances so the impact is more than just x86_64. This was brought up in https://suse.slack.com/archives/C02CANHLANP/p1673523496304249

(Sofia Syrianidou) what's wrong with o3 s390x? I scheduled a couple of test in the morning and they are still not assigned to a worker.

#5 Updated by cdywan 2 months ago

#6 Updated by mkittler 2 months ago

  • Description updated (diff)

As part of #122998 I've been enabling the s390x worker slots on rebel instead.

#7 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#8 Updated by mkittler 2 months ago

  • Status changed from Workable to Blocked

The worker is currently explicitly offline, see blocker. IPMI access works at least (via reverted command).

#9 Updated by cdywan 2 months ago

I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.

#10 Updated by cdywan 2 months ago

#11 Updated by okurz 2 months ago

cdywan wrote:

I guess worker1 should be removed from salt?

No, o3 workers are not in salt. The workers are listed in https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml

Since it's still failing our deployment monitoring.

That'? not a deployment monitoring but an explicit monitoring for o3 workers. Removed the openqaworker1 config for now with

https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83

And also added the missing openqaworker19+20 in a subsequent commit.

#12 Updated by cdywan 2 months ago

  • Due date deleted (2023-01-20)

This is blocking on a blocked ticket. Thus resetting the due date.

#13 Updated by okurz 2 months ago

  • Status changed from Blocked to Feedback
  • Priority changed from Urgent to Normal

openqaworker1 monitoring was disabled with
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
and we don't need that machine critically so we can reduce priority.
I created https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 to follow Marius's suggestion to use when: manual instead of disabled code. And then eventually when openqaworker1 is usable in FC labs, see #119548, we can try to connect the machine again with o3 over routing over different locations.

@Marius I suggest to set this ticket to "Blocked" by #119548 as soon as https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 is merged

#14 Updated by mkittler about 2 months ago

  • Status changed from Feedback to Blocked

#15 Updated by mkittler 12 days ago

Looks like the worker is now in FC: https://racktables.suse.de/index.php?page=object&object_id=1260

I couldn't reach it via SSH (from ariel) or IPMI, though. So I guess this ticket is still blocked.

#16 Updated by mkittler 12 days ago

  • Status changed from Blocked to Feedback

I set this ticket to feedback because I'm not sure what other ticket I'm waiting for. Surely the AC problem in the TAM lab isn't relevant anymore and #119548 is resolved. So we need to talk about it in the unblock meeting.

#17 Updated by okurz 12 days ago

mkittler wrote:

I set this ticket to feedback because I'm not sure what other ticket I'm waiting for.

Well, it was #119548 which is resolved so you can continue.

What we can do as next step is to done of the following

  1. Find the dynamic DHCP lease, e.g. from ip n on a neighboring machine -> DONE from qa-jump, no match
  2. Or wait for https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 so that we would be able to find the DHCP lease from the DHCP server directly
  3. Add the machine into the ops salt repo with both the ipmi+prod Ethernet and use it as experimental OSD worker from FC Basement
  4. or skip step 3. and make it work as o3 worker,
    • 4a. either coordinate with Eng-Infra how to connect it into the o3 network
    • 4b. just connect it over the public https interface https://openqa.opensuse.org

#19 Updated by mkittler 5 days ago

The MR is still pending.

When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.

#20 Updated by okurz 5 days ago

mkittler wrote:

The MR is still pending.

When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.

Yes, of course it wasn't considered yet. That is why we do this exploration task here :) What about 4b? Just connect to https://openqa.opensuse.org?

#21 Updated by mkittler 5 days ago

  • Status changed from Feedback to Blocked

#22 Updated by mkittler 3 days ago

The MR has been merged but I cannot resolve openqaworker1-ipmi.qe.nue2.suse.org or openqaworker1.qe.nue2.suse.org. I'm using VPN and I can resolve e.g. thincsus.qe.nue2.suse.org so it is likely not a local problem.

I'm also unable to establish an IPMI or SSH connection using the IPs. Maybe this needs on-site investigation?

#23 Updated by mkittler 3 days ago

  • Status changed from Blocked to Feedback

Also available in: Atom PDF