action #122983
closed[alert] openqa/monitor-o3 failing because openqaworker1 is down size:M
0%
Description
Observation¶
openqa/monitor-o3 is failing because openqaworker1 is down:
PING openqaworker1.openqanet.opensuse.org (192.168.112.6) 56(84) bytes of data.
2388--- openqaworker1.openqanet.opensuse.org ping statistics ---
23891 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
Acceptance criteria¶
- AC1: openqaworker1 is up and survives reboots
Rollback steps¶
- Disable s390x worker slots on rebel again (to use the setup on openqaworker1 again instead).
Suggestions¶
- Try to login
- Reboot via ipmi
Updated by livdywan almost 2 years ago
- Subject changed from [alert] openqaworker1 to [alert] openqa/monitor-o3 failing because openqaworker1 is down
Updated by okurz almost 2 years ago
- Tags set to infra
- Priority changed from High to Urgent
As long as the monitoring pipeline is active it will bug about this, so this needs urgent handling, at best today
Updated by livdywan almost 2 years ago
- Subject changed from [alert] openqa/monitor-o3 failing because openqaworker1 is down to [alert] openqa/monitor-o3 failing because openqaworker1 is down size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 2 years ago
w1 runs s390x instances so the impact is more than just x86_64. This was brought up in https://suse.slack.com/archives/C02CANHLANP/p1673523496304249
(Sofia Syrianidou) what's wrong with o3 s390x? I scheduled a couple of test in the morning and they are still not assigned to a worker.
Updated by livdywan almost 2 years ago
- Blocked by action #123028: A/C broken in TAM lab size:M added
Updated by mkittler almost 2 years ago
- Status changed from Workable to Blocked
The worker is currently explicitly offline, see blocker. IPMI access works at least (via reverted command).
Updated by livdywan almost 2 years ago
I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.
Updated by livdywan almost 2 years ago
cdywan wrote:
I guess worker1 should be removed from salt? Since it's still failing our deployment monitoring.
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
Updated by okurz almost 2 years ago
cdywan wrote:
I guess worker1 should be removed from salt?
No, o3 workers are not in salt. The workers are listed in https://gitlab.suse.de/openqa/monitor-o3/-/blob/master/.gitlab-ci.yml
Since it's still failing our deployment monitoring.
That'? not a deployment monitoring but an explicit monitoring for o3 workers. Removed the openqaworker1 config for now with
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
And also added the missing openqaworker19+20 in a subsequent commit.
Updated by livdywan almost 2 years ago
- Due date deleted (
2023-01-20)
This is blocking on a blocked ticket. Thus resetting the due date.
Updated by okurz almost 2 years ago
- Status changed from Blocked to Feedback
- Priority changed from Urgent to Normal
openqaworker1 monitoring was disabled with
https://gitlab.suse.de/openqa/monitor-o3/-/commit/121f84cd71de4b8c9e226cec34f0f5bc287d4f83
and we don't need that machine critically so we can reduce priority.
I created https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 to follow Marius's suggestion to use when: manual
instead of disabled code. And then eventually when openqaworker1 is usable in FC labs, see #119548, we can try to connect the machine again with o3 over routing over different locations.
@Marius I suggest to set this ticket to "Blocked" by #119548 as soon as https://gitlab.suse.de/openqa/monitor-o3/-/merge_requests/9 is merged
Updated by mkittler almost 2 years ago
Looks like the worker is now in FC: https://racktables.suse.de/index.php?page=object&object_id=1260
I couldn't reach it via SSH (from ariel) or IPMI, though. So I guess this ticket is still blocked.
Updated by mkittler almost 2 years ago
- Status changed from Blocked to Feedback
I set this ticket to feedback because I'm not sure what other ticket I'm waiting for. Surely the AC problem in the TAM lab isn't relevant anymore and #119548 is resolved. So we need to talk about it in the unblock meeting.
Updated by okurz almost 2 years ago
mkittler wrote:
I set this ticket to feedback because I'm not sure what other ticket I'm waiting for.
Well, it was #119548 which is resolved so you can continue.
What we can do as next step is to done of the following
- Find the dynamic DHCP lease, e.g. from
ip n
on a neighboring machine -> DONE from qa-jump, no match - Or wait for https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 so that we would be able to find the DHCP lease from the DHCP server directly
- Add the machine into the ops salt repo with both the ipmi+prod Ethernet and use it as experimental OSD worker from FC Basement
- or skip step 3. and make it work as o3 worker,
- 4a. either coordinate with Eng-Infra how to connect it into the o3 network
- 4b. just connect it over the public https interface https://openqa.opensuse.org
Updated by mkittler almost 2 years ago
I've been creating https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275 for 3.
Updated by mkittler over 1 year ago
The MR is still pending.
When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.
Updated by okurz over 1 year ago
mkittler wrote:
The MR is still pending.
When asking about 4. on Slack I've only got feedback from Matthias stating that this use case wasn't considered. So perhaps it isn't easy to implement now.
Yes, of course it wasn't considered yet. That is why we do this exploration task here :) What about 4b? Just connect to https://openqa.opensuse.org?
Updated by mkittler over 1 year ago
The MR has been merged but I cannot resolve openqaworker1-ipmi.qe.nue2.suse.org or openqaworker1.qe.nue2.suse.org. I'm using VPN and I can resolve e.g. thincsus.qe.nue2.suse.org so it is likely not a local problem.
I'm also unable to establish an IPMI or SSH connection using the IPs. Maybe this needs on-site investigation?
Updated by mkittler over 1 year ago
- Status changed from Feedback to In Progress
I can now resolve both domains and establish an IPMI connection. So whatever the problem was, it is now solved. The machine was powered off so I've just powered it on. Let's see whether I can simply connect it to o3 as like I would connect any public worker.
The system boots and has a link via:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 2c:60:0c:73:03:d6 brd ff:ff:ff:ff:ff:ff
altname enp1s0f0
altname ens255f0
inet 192.168.112.6/24 brd 192.168.112.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2e60:cff:fe73:3d6/64 scope link
valid_lft forever preferred_lft forever
However, the IP doesn't match the one configured by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3275 and there's no IP connectivity.
Since IPMI is at least working I've created https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515.
Updated by openqa_review over 1 year ago
- Due date set to 2023-04-13
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/515 merged.
/etc/sysconfig/network/ifcfg-eth0 shows
BOOTPROTO='static'
STARTMODE='auto'
IPADDR='192.168.112.6/24'
ZONE=trusted
so configure that to DHCP and try again
Updated by mkittler over 1 year ago
- Status changed from In Progress to Resolved
I've also had to get rid of the static DNS server configured in /etc/sysconfig/network/config
. With that networking looks good and it can connect to o3. There are also no failed systemd services. Everything survived a reboot so I guess AC1 is fullfilled.
So I'm considering this ticket resolved for now. Let me know if I should still look into some of the other options.
Updated by okurz over 1 year ago
mkittler wrote:
So I'm considering this ticket resolved for now. Let me know if I should still look into some of the other options.
No need for that but we should ensure our wiki describing the o3 infra covers openqaworker1 in the current state. And please check the racktables entry that it correctly describes the current use
Updated by favogt over 1 year ago
- Status changed from Feedback to Workable
Apparently ow1 is alive again and attempted to run some jobs on o3.
However, they fail:
[2023-03-31T10:10:29.124064+02:00] [error] Unable to setup job 3202162: The source directory /var/lib/openqa/share/tests/opensuse does not exist
It appears like the IP also changed from 10.168.192.6 to 10.168.192.120. While the latter is pingable from o3, SSH does not work.
For the time being I just did systemctl disable --now openqa-worker-auto-restart@{1..20}.service
as workaround.
Updated by mkittler over 1 year ago
- Status changed from Workable to In Progress
I've enabled the services again but I setup a special worker class for testing. I suppose the main problem was simply that the test pool server hasn't been adapted yet.
Note that you can simply edit /etc/openqa/workers.ini
changing the worker class. There's no need to deal with systemd services.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
I'm afraid this setup is not going to work because the rsync server on ariel is not exposed. So it is not possibly to sync tests. The same counts for NFS.
We could sync /var/lib/openqa/share/tests
from OSD instead but it is likely a bad idea as the directory might contain internal files (e.g. SLE needles).
Since I keep the worker up but only with WORKER_CLASS=openqaworker1
so it doesn't do any harm.
Note that AC1 is nevertheless fulfilled so I'm inclined to resolve this ticket. Especially because I cannot do much about it anyways. I've also already attempted to connect with Infra as suggested in option 4a of #122983#note-17 but haven't got a useful response. I could create an SD ticket if that's wanted, though. Otherwise we could use the worker as an OSD worker.
Updated by okurz over 1 year ago
mkittler wrote:
Note that AC1 is nevertheless fulfilled so I'm inclined to resolve this ticket.
That would bring the risk that ow1 might idle for years wasting power and nobody is making good use of the machine.
I think using the machine as part of OSD is also fine for the time being. Then at least it's put to good use
Updated by mkittler over 1 year ago
And another alternative that came up in the chat: Setup fetchneedles on ow1 as it is normally done on the web UI.
Note that in case we don't use the machine I would always power it off. So we'd at least not waste any power :-)
Updated by mkittler over 1 year ago
I've just setup fetchneedles in accordance with the o3 web UI host. It generally works. There are still problems:
- The developer mode doesn't work and I don't think we can fix that. I suppose this is something we could live with, though.
- The openQA-in-openQA test I've tried could not resolve
codecs.opensuse.org
from within the SUT: https://openqa.opensuse.org/tests/3207420#step/openqa_webui/9- I'm not yet sure why that is. The domain is resolvable on ow1 in general and curl returns data.
- The problem persists after restarting.
- Another test also runs into errors on
zypper in …
: https://openqa.opensuse.org/tests/3207418#step/prepare/11
Maybe it is better to just use it as OSD worker for now.
Updated by mkittler over 1 year ago
Maybe the problems mentioned in my previous comment can be explained by #127256. I've nevertheless configured the worker now to connect to OSD to cross-check. (Of course still using just openqaworker1
as WORKER_CLASS
.)
Updated by mkittler over 1 year ago
I've cloned an OSD job and it ran into a random DNS error as well: https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6
So is suspect this ticket is really related to #127256. I suppose that also means it is blocked by #127256 because without reliable DNS we cannot use the machine as worker.
Updated by mkittler over 1 year ago
- Blocked by action #127256: missing nameservers in dhcp response for baremetal machines in NUE-FC-B 2 size:M added
Updated by okurz over 1 year ago
- Related to action #126188: [openQA][infra][worker][sut] openQA infra performance fluctuates to the level that that leads to tangible test run failure size:M added
Updated by livdywan over 1 year ago
- Due date changed from 2023-04-13 to 2023-04-28
mkittler wrote:
I've cloned an OSD job and it ran into a random DNS error as well: https://openqa.suse.de/tests/10863595#step/nautilus_open_ftp/6
So is suspect this ticket is really related to #127256. I suppose that also means it is blocked by #127256 because without reliable DNS we cannot use the machine as worker.
Presumably still blocking on #127256, hence bumping the due date.
Updated by okurz over 1 year ago
Due to progress within https://sd.suse.com/servicedesk/customer/portal/1/SD-113959 we can now debug the DHCP server on walter1.qe.nue2.suse.org. mkittler and me did over an IPMI SoL on worker1 ifdown eth0 && ifup eth0
and got a complete entry in /etc/resolv.conf so that did not immediately reproduce the problem that /etc/resolv.conf would be incomplete.
It seems that both walter1+walter2 can serve DHCP requests using a failover but with synchronized entries so we should be fine to just look at one journal at a time.
There is an error showing up in the dhcpd journal "dns2.qe.nue2.suse.org: host unknown.". Apparantely that host does not exist in any references on walter1:/etc/ nor walter2:/etc/ except for the dhcpd configs trying to publish that nameserver.
We removed that entry for now on both walter1 and walter2
I ran
for i in {1..30}; do echo "### Run: $i -- $(date -Is)" && ifdown eth0 && ifup eth0 ; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; done
but couldn't reproduce any problems with nameserver config yet.
Maybe with restarting the complete network stack:
for i in {1..30}; do echo "### Run: $i -- $(date -Is)" && systemctl restart network.service ; until ifstatus eth0 | grep -q not-running; do echo -n "." && sleep 1; done; ifstatus eth0; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; done
which did never return from the inner loop likely because ifstatus still shows "device-no-running" due to DHCPv6 never fulfilled. So changed to use ifstatus eth0 | grep -q not-running
instead of just exit code evaluation.
This seems to work. Now let's try to break the loop as soon as nameserver entries are completely missing.
for i in {1..100000}; do echo "### Run: $i -- $(date -Is)" && systemctl restart network.service ; until ifstatus eth0 | grep -q not-running; do echo -n "." && sleep 1; done; ifstatus eth0; tail -n 5 /etc/resolv.conf ; ip a show dev eth0; ls -l /etc/resolv.conf; grep -q nameserver /etc/resolv.conf || break; done
EDIT: Not reproduced after 333 runs. I guess we can't reproduce like this. I suggest to try with actual reboots.
Updated by livdywan over 1 year ago
Discussed in the Unblock. Please try and reproduce using openQA tests, and if that doesn't reproduce it consider it solved.
Updated by pcervinka over 1 year ago
Maybe you can check https://progress.opensuse.org/issues/127256#note-11 if it helps.
Updated by mkittler over 1 year ago
The issue is not resolved. Then I tried to run a job on openqaworker1 it was stuck in the state setup because the worker itself lacked the nameserver. So this does not only happen after a reboot but can also happen in the middle (as openqaworker1 has been running for 5 days and could initially connect to the web UI).
When running the loop from above (which effectively restart the network via systemctl restart network.service
) this changes nothing. The nameserver is still missing in /etc/resolv.conf
. In the DHCP logs it looks like this:
Apr 26 13:59:34 walter1 dhcpd[29309]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 13:59:34 walter1 dhcpd[29309]: dns2.qe.nue2.suse.org: host unknown.
Apr 26 13:59:34 walter1 dhcpd[29309]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0
Apr 26 13:59:34 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 13:59:34 walter2 dhcpd[30886]: dns2.qe.nue2.suse.org: host unknown.
Apr 26 13:59:34 walter2 dhcpd[30886]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0
Not sure whether the message about dns2.qe.nue2.suse.org
shown in the middle is relevant.
After restarting wicket a 3rd time it worked again. Now the logs look different:
Apr 26 14:12:56 walter1 dhcpd[29309]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 14:12:56 walter1 dhcpd[29309]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0
Apr 26 14:12:56 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.120 from 2c:60:0c:73:03:d6 via eth0
Apr 26 14:12:56 walter2 dhcpd[30886]: DHCPACK on 10.168.192.120 to 2c:60:0c:73:03:d6 via eth0
Between the 2nd and 3rd attempt the following was logged:
Apr 26 14:06:19 walter2 dhcpd[30886]: balancing pool 55ab0933cb60 10.168.192.0/22 total 201 free 88 backup 107 lts 9 max-own (+/-)20
Apr 26 14:06:19 walter2 dhcpd[30886]: balanced pool 55ab0933cb60 10.168.192.0/22 total 201 free 88 backup 107 lts 9 max-misbal 29
Apr 26 14:06:22 walter2 dhcpd[30886]: reuse_lease: lease age 5023 (secs) under 25% threshold, reply with unaltered, existing lease for 10.168.193.56
Apr 26 14:06:22 walter2 dhcpd[30886]: No hostname for 10.168.193.56
Apr 26 14:06:22 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.193.56 from 98:be:94:4b:8e:98 via eth0
Apr 26 14:06:22 walter2 dhcpd[30886]: dns2.qe.nue2.suse.org: host unknown.
Apr 26 14:06:22 walter2 dhcpd[30886]: DHCPACK on 10.168.193.56 to 98:be:94:4b:8e:98 via eth0
Apr 26 14:07:00 walter2 dhcpd[30886]: DHCPDISCOVER from 00:0a:f7:de:79:54 via eth0
Apr 26 14:07:00 walter2 dhcpd[30886]: DHCPOFFER on 10.168.192.93 to 00:0a:f7:de:79:54 via eth0
Apr 26 14:07:04 walter2 dhcpd[30886]: DHCPREQUEST for 10.168.192.93 (10.168.192.2) from 00:0a:f7:de:79:54 via eth0
Apr 26 14:07:04 walter2 dhcpd[30886]: DHCPACK on 10.168.192.93 to 00:0a:f7:de:79:54 via eth0
Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 0 deleted host decls to leases file.
Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 0 new dynamic host decls to leases file.
Apr 26 14:10:51 walter2 dhcpd[30886]: Wrote 201 leases to leases file.
Updated by mkittler over 1 year ago
I've just tried with only one dhcp server (the one on walter1, stopped the one on walter2). The problem was still reproducible. However, after removing dns2.qe.nue2.suse.org
from dhcpd.conf
it seems ok. Maybe it makes sense to remove that entry.
Updated by mkittler over 1 year ago
If we're lucky everything boils down to fixing a typo: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456
Updated by okurz over 1 year ago
https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3456 was merged and is deployed to both our DHCP servers walter1.qe.nue2.suse.org and walter2.qe.nue2.suse.org . We assume this fixes the problem.
Updated by mkittler over 1 year ago
I'm running a few more tests (have just restarted https://openqa.suse.de/tests/10992724).
So, if everything looks good - how should I proceed:
- Add the worker as OSD worker. That would mean adding it to our salt infrastructure.
- Add the worker as o3 worker. That would mean setting up fetchneedles in accordance with o3. I have already done it in #122983#note-37. The caveat of that approach:
- This setup might become out-of-sync with o3 and then needs to be deal with manually. While it is not a big deal this means the worker might be in a state where it produces incompletes until we take care of it.
- The mount
/var/lib/openqa/share
will not be available on that worker. We want to avoid relying on it anyways but nevertheless not having it makes this worker odd and prone to produce incompletes when tests rely on it after all.
I would tend to use it as an OSD worker.
Updated by mkittler over 1 year ago
For running some tests I keep the worker as OSD worker. I've cloned a few tests via sudo openqa-clone-job --skip-chained-deps --skip-download --within-instance https://openqa.suse.de/tests/… _GROUP=0 BUILD+=-ow1-test TEST+=-ow1-test WORKER_CLASS=openqaworker1
:
- passed: https://openqa.suse.de/tests/10992724
- softfailure: https://openqa.suse.de/tests/10992727
- strange failure¹ but not related to specific worker: https://openqa.suse.de/tests/10992730
- passed: https://openqa.suse.de/tests/10992732
¹ Reason: api failure: 400 response: OpenQA::Schema::Result::Jobs::insert_module(): DBI Exception: DBD::Pg::st execute failed: ERROR: null value in column "name" of relation "job_modules" violates not-null constraint DETAIL: Failing row contains (2864833368, 10992730, null, tests/btrfs-progs/generate_report…
- Maybe a bug/race-condition in the code for uploading external results.
Updated by okurz over 1 year ago
mkittler wrote:
I'm running a few more tests (have just restarted https://openqa.suse.de/tests/10992724).
So, if everything looks good - how should I proceed:
- Add the worker as OSD worker. That would mean adding it to our salt infrastructure.
- Add the worker as o3 worker. That would mean setting up fetchneedles in accordance with o3. I have already done it in #122983#note-37. The caveat of that approach:
- This setup might become out-of-sync with o3 and then needs to be deal with manually. While it is not a big deal this means the worker might be in a state where it produces incompletes until we take care of it.
- The mount
/var/lib/openqa/share
will not be available on that worker. We want to avoid relying on it anyways but nevertheless not having it makes this worker odd and prone to produce incompletes when tests rely on it after all.I would tend to use it as an OSD worker.
I would say yes. We could theoretically think about a feature to make full asset+tests syncing possible over https but then again we plan to load and likely "cache" tests from git so I guess for that we better wait for #58184
Updated by mkittler over 1 year ago
I've been adding the worker to salt and created a MR for its configuration: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528
Updated by nicksinger over 1 year ago
Note that this worker triggered an "failed systemd service" alert on 2023-04-28 at 14:15 - not sure if this was caused by you working on the machine or if something failed unexpectedly. This is what was shown in the journal:
Apr 28 13:58:01 openqaworker1 systemd[1]: Reloading openQA Worker #10...
Apr 28 13:58:01 openqaworker1 worker[26373]: [info] Received signal HUP
Apr 28 13:58:01 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Deactivated successfully.
Apr 28 13:58:01 openqaworker1 systemd[1]: Reloaded openQA Worker #10.
Apr 28 13:58:01 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 4.581s CPU time.
Apr 28 13:58:02 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Scheduled restart job, restart counter is at 11.
Apr 28 13:58:02 openqaworker1 systemd[1]: Stopped openQA Worker #10.
Apr 28 13:58:02 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 4.581s CPU time.
Apr 28 13:58:02 openqaworker1 systemd[1]: Starting openQA Worker #10...
Apr 28 13:58:02 openqaworker1 systemd[1]: Started openQA Worker #10.
Apr 28 13:58:03 openqaworker1 worker[24048]: [info] [pid:24048] worker 10:
Apr 28 13:58:03 openqaworker1 worker[24048]: - config file: /etc/openqa/workers.ini
Apr 28 13:58:03 openqaworker1 worker[24048]: - name used to register: openqaworker1
Apr 28 13:58:03 openqaworker1 worker[24048]: - worker address (WORKER_HOSTNAME): localhost
Apr 28 13:58:03 openqaworker1 worker[24048]: - isotovideo version: 38
Apr 28 13:58:03 openqaworker1 worker[24048]: - websocket API version: 1
Apr 28 13:58:03 openqaworker1 worker[24048]: - web UI hosts: localhost
Apr 28 13:58:03 openqaworker1 worker[24048]: - class: ?
Apr 28 13:58:03 openqaworker1 worker[24048]: - no cleanup: no
Apr 28 13:58:03 openqaworker1 worker[24048]: - pool directory: /var/lib/openqa/pool/10
Apr 28 13:58:03 openqaworker1 worker[24048]: API key and secret are needed for the worker connecting localhost
Apr 28 13:58:03 openqaworker1 worker[24048]: at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 50.
Apr 28 13:58:03 openqaworker1 worker[24048]: OpenQA::Worker::WebUIConnection::new("OpenQA::Worker::WebUIConnection", "localhost", HASH(0x55fdacf3cc60)) called at /usr/share/openqa/script/../l>
Apr 28 13:58:03 openqaworker1 worker[24048]: OpenQA::Worker::init(OpenQA::Worker=HASH(0x55fdb04186a8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 363
Apr 28 13:58:03 openqaworker1 worker[24048]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x55fdb04186a8)) called at /usr/share/openqa/script/worker line 125
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=exited, status=255/EXCEPTION
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'exit-code'.
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.122s CPU time.
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Scheduled restart job, restart counter is at 12.
Apr 28 13:58:03 openqaworker1 systemd[1]: Stopped openQA Worker #10.
Apr 28 13:58:03 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.122s CPU time.
Apr 28 13:58:03 openqaworker1 systemd[1]: Starting openQA Worker #10...
Apr 28 13:58:03 openqaworker1 systemd[1]: Started openQA Worker #10.
Apr 28 13:58:04 openqaworker1 worker[24114]: [info] [pid:24114] worker 10:
Apr 28 13:58:04 openqaworker1 worker[24114]: - config file: /etc/openqa/workers.ini
Apr 28 13:58:04 openqaworker1 worker[24114]: - name used to register: openqaworker1
Apr 28 13:58:04 openqaworker1 worker[24114]: - worker address (WORKER_HOSTNAME): localhost
Apr 28 13:58:04 openqaworker1 worker[24114]: - isotovideo version: 38
Apr 28 13:58:04 openqaworker1 worker[24114]: - websocket API version: 1
Apr 28 13:58:04 openqaworker1 worker[24114]: - web UI hosts: localhost
Apr 28 13:58:04 openqaworker1 worker[24114]: - class: ?
Apr 28 13:58:04 openqaworker1 worker[24114]: - no cleanup: no
Apr 28 13:58:04 openqaworker1 worker[24114]: - pool directory: /var/lib/openqa/pool/10
Apr 28 13:58:04 openqaworker1 worker[24114]: API key and secret are needed for the worker connecting localhost
Apr 28 13:58:04 openqaworker1 worker[24114]: at /usr/share/openqa/script/../lib/OpenQA/Worker/WebUIConnection.pm line 50.
Apr 28 13:58:04 openqaworker1 worker[24114]: OpenQA::Worker::WebUIConnection::new("OpenQA::Worker::WebUIConnection", "localhost", HASH(0x563fe6781c60)) called at /usr/share/openqa/script/../l>
Apr 28 13:58:04 openqaworker1 worker[24114]: OpenQA::Worker::init(OpenQA::Worker=HASH(0x563fe9c5d2f8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 363
Apr 28 13:58:04 openqaworker1 worker[24114]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x563fe9c5d2f8)) called at /usr/share/openqa/script/worker line 125
Apr 28 13:58:04 openqaworker1 systemd[1]: Stopping openQA Worker #10...
Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Main process exited, code=exited, status=255/EXCEPTION
Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Failed with result 'exit-code'.
Apr 28 13:58:04 openqaworker1 systemd[1]: Stopped openQA Worker #10.
Apr 28 13:58:04 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Consumed 1.024s CPU time.
Apr 28 15:56:12 openqaworker1 systemd[1]: openqa-worker-auto-restart@10.service: Unit cannot be reloaded because it is inactive.
Updated by nicksinger over 1 year ago
Found another issue with our deployment pipeline today. It complains about a missing folder to upgrade the openQA package:
(1/5) Installing: openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64 [......
error: unpacking of archive failed on file /var/lib/openqa/share/factory: cpio: chown failed - No such file or directory
error: openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64: install failed
error: openQA-4.6.1682608278.68a0ff2-lp154.5738.1.x86_64: erase skipped
error]
Installation of openQA-4.6.1682696190.26b7581-lp154.5745.1.x86_64 failed:
Error: Subprocess failed. Error: RPM failed: Command exited with status 1.
Is there a problem with our package too which this worker just shows now? If so, feel free to split this off into another ticket.
Updated by okurz over 1 year ago
Is this really on worker1? We have observed the same problem on baremetal-supportserver. The problem only happens if there is 1. openQA webui installed, 2. NFS share from another webui server, 3. Mismatch in uids. On baremetal-supportserver we fixed this by syncing uids manually but here I suggest to remove the openQA package as only the worker package should be necessary
Updated by mkittler over 1 year ago
#122983#note-55 should be fixed by uninstalling the openQA web UI package. I had only installed it for fetchneeles to test it as o3 worker.
Updated by mkittler over 1 year ago
#122983#note-54 should be fixed by https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/528 - it is just that salt at this point treats the machine as generic host and wiped the worker config to the bare minimum. So host is an empty string and we don't have API credentials for that "host".
Updated by okurz over 1 year ago
- Due date changed from 2023-04-28 to 2023-05-12
discussed in daily, bumped due-date accordingly
Updated by mkittler over 1 year ago
I've cloned the last 100 successful tests on OSD (excluding ones with parallel dependencies):
openqa=# \copy (select distinct jobs.id from jobs join job_settings on jobs.id = job_settings.job_id left join job_dependencies on (jobs.id = child_job_id or jobs.id = parent_job_id) where dependency != 2 and result = 'passed' and job_settings.key = 'WORKER_CLASS' and job_settings.value = 'qemu_x86_64' order by id desc limit 100) to '/tmp/jobs_to_clone_x86_64' csv;
COPY 100
martchus@openqa:~> for job_id in $(cat /tmp/jobs_to_clone_x86_64 ) ; do openqa-clone-job --host openqa.suse.de --apikey … --apisecret … --skip-download --skip-chained-deps --clone-children --parental-inheritance "https://openqa.suse.de/tests/$job_id" _GROUP=0 TEST+=-ow1-test BUILD=test-ow1 WORKER_CLASS=openqaworker1 ; done
Apparently some of the jobs have chained children so we've got actually more than 100 jobs. Link to overview: https://openqa.suse.de/tests/overview?build=test-ow1
Updated by mkittler over 1 year ago
I've just been reviewing the overview. The failures are due to:
- Jobs requiring a tap setup have accidentally been cloned.
- Jobs requiring a private asset have accidentally been cloned.
- Some jobs fail also sometimes on other workers in the same way.
However, 120 jobs have passed/softfailed. So I guess that's good enough. I've been creating a MR to enable the worker for real: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532
Updated by okurz over 1 year ago
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/532 merged, please monitor that real production jobs pass on this worker with the new worker class. Please make sure openqaworker1 shows up on https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1 .
Updated by mkittler over 1 year ago
The fail+incomplete ratio is so far similar to other worker hosts:
openqa=# with finished as (select result, t_finished, host from jobs left join workers on jobs.assigned_worker_id = workers.id where result != 'none') select host, round(count(*) filter (where result='failed' or result='incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_failed_by_host, count(*) total from finished where host like '%worker%' and t_finished >= '2023-05-01' group by host order by ratio_failed_by_host desc;
host | ratio_failed_by_host | total
---------------------+----------------------+-------
openqa-piworker | 100 | 12
worker12 | 22.22 | 18
worker11 | 20.43 | 186
worker2 | 18.64 | 1937
openqaworker-arm-3 | 16.72 | 1029
openqaworker1 | 14.35 | 418
worker10 | 13.96 | 523
openqaworker14 | 13.5 | 941
openqaworker17 | 13.42 | 1207
openqaworker-arm-2 | 13.03 | 1036
worker13 | 13.02 | 791
openqaworker18 | 11.91 | 1217
openqaworker16 | 11.76 | 1156
worker3 | 11.52 | 1137
worker5 | 11.51 | 2346
worker6 | 11.19 | 1779
powerqaworker-qam-1 | 10.96 | 374
worker9 | 10.96 | 785
worker8 | 10.61 | 886
openqaworker-arm-1 | 9.76 | 502
(20 rows)
With https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/852 merged the worker shows up on https://monitor.qa.suse.de/d/4KkGdvvZk/osd-status-overview?orgId=1. I can also create another MR for adding otherwise forgotten hosts. However, I wouldn't consider this part of the ticket.
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
Opened a MR to update the dashboard: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/853
With that I'm resolving this ticket.