action #122983: [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by livdywan over 2 years ago

Subject changed from [alert] openqaworker1 to [alert] openqa/monitor-o3 failing because openqaworker1 is down

Actions

Copy link

#2

Updated by okurz over 2 years ago

Tags set to infra
Priority changed from High to Urgent

As long as the monitoring pipeline is active it will bug about this, so this needs urgent handling, at best today

Actions

Copy link

#3

Updated by livdywan over 2 years ago

Subject changed from [alert] openqa/monitor-o3 failing because openqaworker1 is down to [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#4

Updated by okurz over 2 years ago

w1 runs s390x instances so the impact is more than just x86_64. This was brought up in https://suse.slack.com/archives/C02CANHLANP/p1673523496304249

(Sofia Syrianidou) what's wrong with o3 s390x? I scheduled a couple of test in the morning and they are still not assigned to a worker.

Actions

Copy link

#5

Updated by livdywan over 2 years ago

Blocked by action #123028: A/C broken in TAM lab size:M added

Actions

Copy link

#6

Updated by mkittler over 2 years ago

Description updated (diff)

As part of #122998 I've been enabling the s390x worker slots on rebel instead.

Actions

Copy link

#7

Updated by mkittler over 2 years ago

Assignee set to mkittler

Actions

Copy link

#8

Updated by mkittler over 2 years ago

Status changed from Workable to Blocked

The worker is currently explicitly offline, see blocker. IPMI access works at least (via reverted command).

Actions

Copy link

#9

Updated by livdywan over 2 years ago

I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.

Actions

Copy link

#10

Updated by livdywan over 2 years ago

cdywan wrote:

I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.

https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83

Actions

Copy link

#11

Updated by okurz over 2 years ago

cdywan wrote:

I guess worker1 should be removed from salt?

No, o3 workers are not in salt. The workers are listed in https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml

Since it's still failing our deployment monitoring.

That'? not a deployment monitoring but an explicit monitoring for o3 workers. Removed the openqaworker1 config for now with

https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83

And also added the missing openqaworker19+20 in a subsequent commit.

Actions

Copy link

#12

Updated by livdywan over 2 years ago

Due date deleted (~~2023-01-20~~)

This is blocking on a blocked ticket. Thus resetting the due date.

Actions

Copy link

#13

Updated by okurz over 2 years ago

Status changed from Blocked to Feedback
Priority changed from Urgent to Normal

openqaworker1 monitoring was disabled with
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
and we don't need that machine critically so we can reduce priority.
I created https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 to follow Marius's suggestion to use when: manual instead of disabled code. And then eventually when openqaworker1 is usable in FC labs, see #119548, we can try to connect the machine again with o3 over routing over different locations.

@Marius I suggest to set this ticket to "Blocked" by #119548 as soon as https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 is merged

Actions

Copy link

#14

Updated by mkittler over 2 years ago

Status changed from Feedback to Blocked

Actions

Copy link

#15

Updated by mkittler about 2 years ago

Looks like the worker is now in FC: https://racktables.suse.de/index.php?page=object&object_id=1260

I couldn't reach it via SSH (from ariel) or IPMI, though. So I guess this ticket is still blocked.

Actions

Copy link

#16

Updated by mkittler about 2 years ago

Status changed from Blocked to Feedback

I set this ticket to feedback because I'm not sure what other ticket I'm waiting for. Surely the AC problem in the TAM lab isn't relevant anymore and #119548 is resolved. So we need to talk about it in the unblock meeting.

Actions

Copy link

#17

Updated by okurz about 2 years ago

mkittler wrote:

I set this ticket to feedback because I'm not sure what other ticket I'm waiting for.

Well, it was #119548 which is resolved so you can continue.

What we can do as next step is to done of the following

Find the dynamic DHCP lease, e.g. from ip n on a neighboring machine -> DONE from qa-jump, no match
Or wait for https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 so that we would be able to find the DHCP lease from the DHCP server directly
Add the machine into the ops salt repo with both the ipmi+prod Ethernet and use it as experimental OSD worker from FC Basement
or skip step 3. and make it work as o3 worker,

4a. either coordinate with Eng-Infra how to connect it into the o3 network
4b. just connect it over the public https interface https://openqa.opensuse.org

Actions

Copy link

#18

Updated by mkittler about 2 years ago

I've been creating https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275 for 3.

Actions

Copy link

#19

Updated by mkittler about 2 years ago

The MR is still pending.

When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.

Actions

Copy link

#20

Updated by okurz about 2 years ago

mkittler wrote:

The MR is still pending.

When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.

Yes, of course it wasn't considered yet. That is why we do this exploration task here :) What about 4b? Just connect to https://openqa.opensuse.org?

Actions

Copy link

#21

Updated by mkittler about 2 years ago

Status changed from Feedback to Blocked

Actions

Copy link

#22

Updated by mkittler about 2 years ago

The MR has been merged but I cannot resolve openqaworker1-ipmi.qe.nue2.suse.org or openqaworker1.qe.nue2.suse.org. I'm using VPN and I can resolve e.g. thincsus.qe.nue2.suse.org so it is likely not a local problem.

I'm also unable to establish an IPMI or SSH connection using the IPs. Maybe this needs on-site investigation?

Actions

Copy link

#23

Updated by mkittler about 2 years ago

Status changed from Blocked to Feedback

Actions

Copy link

#24

Updated by mkittler about 2 years ago

Status changed from Feedback to In Progress

I can now resolve both domains and establish an IPMI connection. So whatever the problem was, it is now solved. The machine was powered off so I've just powered it on. Let's see whether I can simply connect it to o3 as like I would connect any public worker.

The system boots and has a link via:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 2c:60:0c:73:03:d6 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
    altname ens255f0
    inet 192.168.112.6/24 brd 192.168.112.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::2e60:cff:fe73:3d6/64 scope link 
       valid_lft forever preferred_lft forever

However, the IP doesn't match the one configured by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275 and there's no IP connectivity.

Since IPMI is at least working I've created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515.

Actions

Copy link

#25

Updated by openqa_review about 2 years ago

Due date set to 2023-04-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#26

Updated by okurz about 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515 merged.

/etc/sysconfig/network/ifcfg-eth0 shows

BOOTPROTO='static'
STARTMODE='auto'
IPADDR='192.168.112.6/24'
ZONE=trusted

so configure that to DHCP and try again

Actions

Copy link

#27

Updated by mkittler about 2 years ago

Status changed from In Progress to Resolved

I've also had to get rid of the static DNS server configured in /etc/sysconfig/network/config. With that networking looks good and it can connect to o3. There are also no failed systemd services. Everything survived a reboot so I guess AC1 is fullfilled.

So I'm considering this ticket resolved for now. Let me know if I should still look into some of the other options.

Actions

Copy link

#28

Updated by okurz about 2 years ago

mkittler wrote:

So I'm considering this ticket resolved for now. Let me know if I should still look into some of the other options.

No need for that but we should ensure our wiki describing the o3 infra covers openqaworker1 in the current state. And please check the racktables entry that it correctly describes the current use

Actions

Copy link

#29

Updated by okurz about 2 years ago

Status changed from Resolved to Feedback

Actions

Copy link

#30

Updated by favogt about 2 years ago

Status changed from Feedback to Workable

Apparently ow1 is alive again and attempted to run some jobs on o3.

However, they fail:

[2023-03-31T10:10:29.124064+02:00] [error] Unable to setup job 3202162: The source directory /var/lib/openqa/share/tests/opensuse does not exist

It appears like the IP also changed from 10.168.192.6 to 10.168.192.120. While the latter is pingable from o3, SSH does not work.

For the time being I just did systemctl disable --now openqa-worker-auto-restart@{1..20}.service as workaround.

Actions

Copy link

#31

Updated by okurz about 2 years ago

Priority changed from Normal to Urgent

Actions

Copy link

#32

Updated by mkittler about 2 years ago

Status changed from Workable to In Progress

I've enabled the services again but I setup a special worker class for testing. I suppose the main problem was simply that the test pool server hasn't been adapted yet.

Note that you can simply edit /etc/openqa/workers.ini changing the worker class. There's no need to deal with systemd services.

Actions

Copy link

#33

Updated by mkittler about 2 years ago

Status changed from In Progress to Feedback

I'm afraid this setup is not going to work because the rsync server on ariel is not exposed. So it is not possibly to sync tests. The same counts for NFS.

We could sync /var/lib/openqa/share/tests from OSD instead but it is likely a bad idea as the directory might contain internal files (e.g. SLE needles).

Since I keep the worker up but only with WORKER_CLASS=openqaworker1 so it doesn't do any harm.

Note that AC1 is nevertheless fulfilled so I'm inclined to resolve this ticket. Especially because I cannot do much about it anyways. I've also already attempted to connect with Infra as suggested in option 4a of #122983#note-17 but haven't got a useful response. I could create an SD ticket if that's wanted, though. Otherwise we could use the worker as an OSD worker.

Actions

Copy link

#34

Updated by mkittler about 2 years ago

Priority changed from Urgent to High

Actions

Copy link

#35

Updated by okurz about 2 years ago

mkittler wrote:

Note that AC1 is nevertheless fulfilled so I'm inclined to resolve this ticket.

That would bring the risk that ow1 might idle for years wasting power and nobody is making good use of the machine.

I think using the machine as part of OSD is also fine for the time being. Then at least it's put to good use

Actions

Copy link

#36

Updated by mkittler about 2 years ago

And another alternative that came up in the chat: Setup fetchneedles on ow1 as it is normally done on the web UI.

Note that in case we don't use the machine I would always power it off. So we'd at least not waste any power :-)

Actions

Copy link

#37

Updated by mkittler about 2 years ago

I've just setup fetchneedles in accordance with the o3 web UI host. It generally works. There are still problems:

The developer mode doesn't work and I don't think we can fix that. I suppose this is something we could live with, though.
The openQA-in-openQA test I've tried could not resolve codecs.opensuse.org from within the SUT: https://openqa.opensuse.org/tests/3207420#step/openqa_webui/9
- I'm not yet sure why that is. The domain is resolvable on ow1 in general and curl returns data.
- The problem persists after restarting.
- Another test also runs into errors on zypper in …: https://openqa.opensuse.org/tests/3207418#step/prepare/11

Maybe it is better to just use it as OSD worker for now.

Actions

Copy link

#38

Updated by mkittler about 2 years ago

Maybe the problems mentioned in my previous comment can be explained by #127256. I've nevertheless configured the worker now to connect to OSD to cross-check. (Of course still using just openqaworker1 as WORKER_CLASS.)

Actions

Copy link

#39

Updated by mkittler about 2 years ago

I've cloned an OSD job and it ran into a random DNS error as well: https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6

So is suspect this ticket is really related to #127256. I suppose that also means it is blocked by #127256 because without reliable DNS we cannot use the machine as worker.

Actions

Copy link

#40

Updated by mkittler about 2 years ago

Blocked by action #127256: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M added

Actions

Copy link

#41

Updated by okurz about 2 years ago

Related to action #126188: [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:M added

Actions

Copy link

#42

Updated by livdywan about 2 years ago

Due date changed from 2023-04-13 to 2023-04-28

mkittler wrote:

I've cloned an OSD job and it ran into a random DNS error as well: https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6

So is suspect this ticket is really related to #127256. I suppose that also means it is blocked by #127256 because without reliable DNS we cannot use the machine as worker.

Presumably still blocking on #127256, hence bumping the due date.

Actions

Copy link

#43

Updated by okurz about 2 years ago

Due to progress within https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 we can now debug the DHCP server on walter1.qe.nue2.suse.org. mkittler and me did over an IPMI SoL on worker1 ifdown eth0 && ifup eth0 and got a complete entry in /etc/resolv.conf so that did not immediately reproduce the problem that /etc/resolv.conf would be incomplete.

It seems that both walter1+walter2 can serve DHCP requests using a failover but with synchronized entries so we should be fine to just look at one journal at a time.

There is an error showing up in the dhcpd journal "dns2.qe.nue2.suse.org: host unknown.". Apparantely that host does not exist in any references on walter1:/etc/ nor walter2:/etc/ except for the dhcpd configs trying to publish that nameserver.

We removed that entry for now on both walter1 and walter2

I ran

for i in {1..30}; do echo "### Run: $i -- $(date -Is)" && ifdown eth0 && ifup eth0 ; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; done

but couldn't reproduce any problems with nameserver config yet.

Maybe with restarting the complete network stack:

for i in {1..30}; do echo "### Run: $i -- $(date -Is)" && systemctl restart network.service ; until ifstatus eth0 | grep -q not-running; do echo -n "." && sleep 1; done; ifstatus eth0; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; done

which did never return from the inner loop likely because ifstatus still shows "device-no-running" due to DHCPv6 never fulfilled. So changed to use ifstatus eth0 | grep -q not-running instead of just exit code evaluation.

This seems to work. Now let's try to break the loop as soon as nameserver entries are completely missing.

for i in {1..100000}; do echo "### Run: $i -- $(date -Is)" && systemctl restart network.service ; until ifstatus eth0 | grep -q not-running; do echo -n "." && sleep 1; done; ifstatus eth0; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; grep -q nameserver /etc/resolv.conf || break; done

EDIT: Not reproduced after 333 runs. I guess we can't reproduce like this. I suggest to try with actual reboots.

Actions

Copy link

#44

Updated by livdywan about 2 years ago

Discussed in the Unblock. Please try and reproduce using openQA tests, and if that doesn't reproduce it consider it solved.

Actions

Copy link

#45

Updated by pcervinka about 2 years ago

Maybe you can check https://progress.opensuse.org/issues/127256#note-11 if it helps.

Actions

Copy link

#46

Updated by mkittler about 2 years ago

The issue is not resolved. Then I tried to run a job on openqaworker1 it was stuck in the state setup because the worker itself lacked the nameserver. So this does not only happen after a reboot but can also happen in the middle (as openqaworker1 has been running for 5 days and could initially connect to the web UI).

When running the loop from above (which effectively restart the network via systemctl restart network.service) this changes nothing. The nameserver is still missing in /etc/resolv.conf. In the DHCP logs it looks like this:

Apr 26 13:59:34 walter1 dhcpd[29309]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 13:59:34 walter1 dhcpd[29309]: dns2.qe.nue2.suse.org: host unknown.
Apr 26 13:59:34 walter1 dhcpd[29309]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0

Apr 26 13:59:34 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 13:59:34 walter2 dhcpd[30886]: dns2.qe.nue2.suse.org: host unknown.
Apr 26 13:59:34 walter2 dhcpd[30886]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0

Not sure whether the message about dns2.qe.nue2.suse.org shown in the middle is relevant.

After restarting wicket a 3rd time it worked again. Now the logs look different:

Apr 26 14:12:56 walter1 dhcpd[29309]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 14:12:56 walter1 dhcpd[29309]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0

Apr 26 14:12:56 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 14:12:56 walter2 dhcpd[30886]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0

Between the 2nd and 3rd attempt the following was logged:

Apr 26 14:06:19 walter2 dhcpd[30886]: balancing pool 55ab0933cb60 10.168.192.0/22  total 201  free 88  backup 107  lts 9  max-own (+/-)20
Apr 26 14:06:19 walter2 dhcpd[30886]: balanced pool 55ab0933cb60 10.168.192.0/22  total 201  free 88  backup 107  lts 9  max-misbal 29
Apr 26 14:06:22 walter2 dhcpd[30886]: reuse_lease: lease age 5023 (secs) under 25% threshold, reply with unaltered, existing lease for 10.168.193.56
Apr 26 14:06:22 walter2 dhcpd[30886]: No hostname for 10.168.193.56
Apr 26 14:06:22 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.193.56 from 98:be:94:4b:8e:98 via eth0
Apr 26 14:06:22 walter2 dhcpd[30886]: dns2.qe.nue2.suse.org: host unknown.
Apr 26 14:06:22 walter2 dhcpd[30886]: DHCPACK on 10.168.193.56 to 98:be:94:4b:8e:98 via eth0
Apr 26 14:07:00 walter2 dhcpd[30886]: DHCPDISCOVER from 00:0a:f7:de:79:54 via eth0
Apr 26 14:07:00 walter2 dhcpd[30886]: DHCPOFFER on 10.168.192.93 to 00:0a:f7:de:79:54 via eth0
Apr 26 14:07:04 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.93 (10.168.192.2) from 00:0a:f7:de:79:54 via eth0
Apr 26 14:07:04 walter2 dhcpd[30886]: DHCPACK on 10.168.192.93 to 00:0a:f7:de:79:54 via eth0
Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 0 deleted host decls to leases file.
Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 0 new dynamic host decls to leases file.
Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 201 leases to leases file.

Actions

Copy link

#47

Updated by mkittler about 2 years ago

I've just tried with only one dhcp server (the one on walter1, stopped the one on walter2). The problem was still reproducible. However, after removing dns2.qe.nue2.suse.org from dhcpd.conf it seems ok. Maybe it makes sense to remove that entry.

Actions

Copy link

#48

Updated by mkittler about 2 years ago

If we're lucky everything boils down to fixing a typo: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456

Actions

Copy link

#49

Updated by okurz about 2 years ago

https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456 was merged and is deployed to both our DHCP servers walter1.qe.nue2.suse.org and walter2.qe.nue2.suse.org . We assume this fixes the problem.

Actions

Copy link

#50

Updated by mkittler about 2 years ago

I'm running a few more tests (have just restarted https://openqa.suse.de/tests/10992724).

So, if everything looks good - how should I proceed:

Add the worker as OSD worker. That would mean adding it to our salt infrastructure.
Add the worker as o3 worker. That would mean setting up fetchneedles in accordance with o3. I have already done it in #122983#note-37. The caveat of that approach:
- This setup might become out-of-sync with o3 and then needs to be deal with manually. While it is not a big deal this means the worker might be in a state where it produces incompletes until we take care of it.
- The mount /var/lib/openqa/share will not be available on that worker. We want to avoid relying on it anyways but nevertheless not having it makes this worker odd and prone to produce incompletes when tests rely on it after all.

I would tend to use it as an OSD worker.

Actions

Copy link

#51

Updated by mkittler about 2 years ago

For running some tests I keep the worker as OSD worker. I've cloned a few tests via sudo openqa-clone-job --skip-chained-deps --skip-download --within-instance https://openqa.suse.de/tests/… _GROUP=0 BUILD+=-ow1-test TEST+=-ow1-test WORKER_CLASS=openqaworker1:

passed: https://openqa.suse.de/tests/10992724
softfailure: https://openqa.suse.de/tests/10992727
strange failure¹ but not related to specific worker: https://openqa.suse.de/tests/10992730
passed: https://openqa.suse.de/tests/10992732

¹ Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: null value in column "name" of relation "job_modules" violates not-null constraint DETAIL: Failing row contains (2864833368, 10992730, null, tests/btrfs-progs/generate_report… - Maybe a bug/race-condition in the code for uploading external results.

Actions

Copy link

#52

Updated by okurz about 2 years ago

mkittler wrote:

I'm running a few more tests (have just restarted https://openqa.suse.de/tests/10992724).

So, if everything looks good - how should I proceed:

Add the worker as OSD worker. That would mean adding it to our salt infrastructure.

Add the worker as o3 worker. That would mean setting up fetchneedles in accordance with o3. I have already done it in #122983#note-37. The caveat of that approach:

This setup might become out-of-sync with o3 and then needs to be deal with manually. While it is not a big deal this means the worker might be in a state where it produces incompletes until we take care of it.

The mount /var/lib/openqa/share will not be available on that worker. We want to avoid relying on it anyways but nevertheless not having it makes this worker odd and prone to produce incompletes when tests rely on it after all.

I would tend to use it as an OSD worker.

I would say yes. We could theoretically think about a feature to make full asset+tests syncing possible over https but then again we plan to load and likely "cache" tests from git so I guess for that we better wait for #58184

Actions

Copy link

#53

Updated by mkittler about 2 years ago

I've been adding the worker to salt and created a MR for its configuration: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528

Actions

Copy link

#54

Updated by nicksinger about 2 years ago

Note that this worker triggered an "failed systemd service" alert on 2023-04-28 at 14:15 - not sure if this was caused by you working on the machine or if something failed unexpectedly. This is what was shown in the journal:

Apr 28 13:58:01 openqaworker1 systemd[1]: Reloading openQA Worker #10...
Apr 28 13:58:01 openqaworker1 worker[26373]: [info] Received signal HUP
Apr 28 13:58:01 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Deactivated successfully.
Apr 28 13:58:01 openqaworker1 systemd[1]: Reloaded openQA Worker #10.
Apr 28 13:58:01 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 4.581s CPU time.
Apr 28 13:58:02 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Scheduled restart job, restart counter is at 11.
Apr 28 13:58:02 openqaworker1 systemd[1]: Stopped openQA Worker #10.
Apr 28 13:58:02 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 4.581s CPU time.
Apr 28 13:58:02 openqaworker1 systemd[1]: Starting openQA Worker #10...
Apr 28 13:58:02 openqaworker1 systemd[1]: Started openQA Worker #10.
Apr 28 13:58:03 openqaworker1 worker[24048]: [info] [pid:24048] worker 10:
Apr 28 13:58:03 openqaworker1 worker[24048]:  - config file:                      /etc/openqa/workers.ini
Apr 28 13:58:03 openqaworker1 worker[24048]:  - name used to register:            openqaworker1
Apr 28 13:58:03 openqaworker1 worker[24048]:  - worker address (WORKER_HOSTNAME): localhost
Apr 28 13:58:03 openqaworker1 worker[24048]:  - isotovideo version:               38
Apr 28 13:58:03 openqaworker1 worker[24048]:  - websocket API version:            1
Apr 28 13:58:03 openqaworker1 worker[24048]:  - web UI hosts:                     localhost
Apr 28 13:58:03 openqaworker1 worker[24048]:  - class:                            ?
Apr 28 13:58:03 openqaworker1 worker[24048]:  - no cleanup:                       no
Apr 28 13:58:03 openqaworker1 worker[24048]:  - pool directory:                   /var/lib/openqa/pool/10
Apr 28 13:58:03 openqaworker1 worker[24048]: API key and secret are needed for the worker connecting localhost
Apr 28 13:58:03 openqaworker1 worker[24048]:  at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 50.
Apr 28 13:58:03 openqaworker1 worker[24048]:         OpenQA::Worker::WebUIConnection::new("OpenQA::Worker::WebUIConnection", "localhost", HASH(0x55fdacf3cc60)) called at /usr/share/openqa/script/../l>
Apr 28 13:58:03 openqaworker1 worker[24048]:         OpenQA::Worker::init(OpenQA::Worker=HASH(0x55fdb04186a8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 363
Apr 28 13:58:03 openqaworker1 worker[24048]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x55fdb04186a8)) called at /usr/share/openqa/script/worker line 125
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=exited, status=255/EXCEPTION
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'exit-code'.
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.122s CPU time.
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Scheduled restart job, restart counter is at 12.
Apr 28 13:58:03 openqaworker1 systemd[1]: Stopped openQA Worker #10.
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.122s CPU time.
Apr 28 13:58:03 openqaworker1 systemd[1]: Starting openQA Worker #10...
Apr 28 13:58:03 openqaworker1 systemd[1]: Started openQA Worker #10.
Apr 28 13:58:04 openqaworker1 worker[24114]: [info] [pid:24114] worker 10:
Apr 28 13:58:04 openqaworker1 worker[24114]:  - config file:                      /etc/openqa/workers.ini
Apr 28 13:58:04 openqaworker1 worker[24114]:  - name used to register:            openqaworker1
Apr 28 13:58:04 openqaworker1 worker[24114]:  - worker address (WORKER_HOSTNAME): localhost
Apr 28 13:58:04 openqaworker1 worker[24114]:  - isotovideo version:               38
Apr 28 13:58:04 openqaworker1 worker[24114]:  - websocket API version:            1
Apr 28 13:58:04 openqaworker1 worker[24114]:  - web UI hosts:                     localhost
Apr 28 13:58:04 openqaworker1 worker[24114]:  - class:                            ?
Apr 28 13:58:04 openqaworker1 worker[24114]:  - no cleanup:                       no
Apr 28 13:58:04 openqaworker1 worker[24114]:  - pool directory:                   /var/lib/openqa/pool/10
Apr 28 13:58:04 openqaworker1 worker[24114]: API key and secret are needed for the worker connecting localhost
Apr 28 13:58:04 openqaworker1 worker[24114]:  at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 50.
Apr 28 13:58:04 openqaworker1 worker[24114]:         OpenQA::Worker::WebUIConnection::new("OpenQA::Worker::WebUIConnection", "localhost", HASH(0x563fe6781c60)) called at /usr/share/openqa/script/../l>
Apr 28 13:58:04 openqaworker1 worker[24114]:         OpenQA::Worker::init(OpenQA::Worker=HASH(0x563fe9c5d2f8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 363
Apr 28 13:58:04 openqaworker1 worker[24114]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x563fe9c5d2f8)) called at /usr/share/openqa/script/worker line 125
Apr 28 13:58:04 openqaworker1 systemd[1]: Stopping openQA Worker #10...
Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=exited, status=255/EXCEPTION
Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'exit-code'.
Apr 28 13:58:04 openqaworker1 systemd[1]: Stopped openQA Worker #10.
Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.024s CPU time.
Apr 28 15:56:12 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Unit cannot be reloaded because it is inactive.

Actions

Copy link

#55

Updated by nicksinger about 2 years ago

Found another issue with our deployment pipeline today. It complains about a missing folder to upgrade the openQA package:

    (1/5) Installing: openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64 [......
    error: unpacking of archive failed on file /var/lib/openqa/share/factory: cpio: chown failed - No such file or directory
    error: openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64: install failed
    error: openQA-4.6.1682608278.68a0ff2-lp154.5738.1.x86_64: erase skipped
    error]
    Installation of openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64 failed:
    Error: Subprocess failed. Error: RPM failed: Command exited with status 1.

Is there a problem with our package too which this worker just shows now? If so, feel free to split this off into another ticket.

Actions

Copy link

#56

Updated by okurz about 2 years ago

Is this really on worker1? We have observed the same problem on baremetal-supportserver. The problem only happens if there is 1. openQA webui installed, 2. NFS share from another webui server, 3. Mismatch in uids. On baremetal-supportserver we fixed this by syncing uids manually but here I suggest to remove the openQA package as only the worker package should be necessary

Actions

Copy link

#57

Updated by mkittler about 2 years ago

#122983#note-55 should be fixed by uninstalling the openQA web UI package. I had only installed it for fetchneeles to test it as o3 worker.

Actions

Copy link

#58

Updated by mkittler about 2 years ago

#122983#note-54 should be fixed by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528 - it is just that salt at this point treats the machine as generic host and wiped the worker config to the bare minimum. So host is an empty string and we don't have API credentials for that "host".

Actions

Copy link

#59

Updated by okurz about 2 years ago

Due date changed from 2023-04-28 to 2023-05-12

discussed in daily, bumped due-date accordingly

Actions

Copy link

#60

Updated by mkittler about 2 years ago

I've cloned the last 100 successful tests on OSD (excluding ones with parallel dependencies):

openqa=# \copy (select distinct jobs.id from jobs join job_settings on jobs.id = job_settings.job_id left join job_dependencies on (jobs.id = child_job_id or jobs.id = parent_job_id) where dependency != 2 and result = 'passed' and job_settings.key = 'WORKER_CLASS' and job_settings.value = 'qemu_x86_64' order by id desc limit 100) to '/tmp/jobs_to_clone_x86_64' csv;
COPY 100

martchus@openqa:~> for job_id in $(cat /tmp/jobs_to_clone_x86_64 ) ; do openqa-clone-job --host openqa.suse.de --apikey … --apisecret … --skip-download --skip-chained-deps --clone-children --parental-inheritance "https://openqa.suse.de/tests/$job_id" _GROUP=0 TEST+=-ow1-test BUILD=test-ow1 WORKER_CLASS=openqaworker1 ; done

Apparently some of the jobs have chained children so we've got actually more than 100 jobs. Link to overview: https://openqa.suse.de/tests/overview?build=test-ow1

Actions

Copy link

#61

Updated by mkittler about 2 years ago

I've just been reviewing the overview. The failures are due to:

Jobs requiring a tap setup have accidentally been cloned.
Jobs requiring a private asset have accidentally been cloned.
Some jobs fail also sometimes on other workers in the same way.

However, 120 jobs have passed/softfailed. So I guess that's good enough. I've been creating a MR to enable the worker for real: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532

Actions

Copy link

#62

Updated by okurz about 2 years ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532 merged, please monitor that real production jobs pass on this worker with the new worker class. Please make sure openqaworker1 shows up on https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 .

Actions

Copy link

#63

Updated by mkittler about 2 years ago

The fail+incomplete ratio is so far similar to other worker hosts:

openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%worker%' and t_finished >= '2023-05-01' group by host order by ratio_failed_by_host desc;
        host         | ratio_failed_by_host | total 
---------------------+----------------------+-------
 openqa-piworker     |                  100 |    12
 worker12            |                22.22 |    18
 worker11            |                20.43 |   186
 worker2             |                18.64 |  1937
 openqaworker-arm-3  |                16.72 |  1029
 openqaworker1       |                14.35 |   418
 worker10            |                13.96 |   523
 openqaworker14      |                 13.5 |   941
 openqaworker17      |                13.42 |  1207
 openqaworker-arm-2  |                13.03 |  1036
 worker13            |                13.02 |   791
 openqaworker18      |                11.91 |  1217
 openqaworker16      |                11.76 |  1156
 worker3             |                11.52 |  1137
 worker5             |                11.51 |  2346
 worker6             |                11.19 |  1779
 powerqaworker-qam-1 |                10.96 |   374
 worker9             |                10.96 |   785
 worker8             |                10.61 |   886
 openqaworker-arm-1  |                 9.76 |   502
(20 rows)

With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/852 merged the worker shows up on https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1. I can also create another MR for adding otherwise forgotten hosts. However, I wouldn't consider this part of the ticket.

Actions

Copy link

#64

Updated by mkittler about 2 years ago

Status changed from Feedback to Resolved

Opened a MR to update the dashboard: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/853

With that I'm resolving this ticket.

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #122983

[alert] openqa/monitor-o3 failing because openqaworker1 is down size:M

Observation¶

Acceptance criteria¶

Rollback steps¶

Suggestions¶

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by livdywan over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by livdywan over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by openqa_review about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by favogt about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by pcervinka about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by nicksinger about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago