action #160424
closed
o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete size:M
Added by nicksinger 7 months ago.
Updated 7 months ago.
Category:
Regressions/Crashes
Description
Motivation¶
In https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$jz0FLvQkzgOBQ1r0WC6ARuAFR0hRBHCO_fYSxDFH2gY @Guillaume_G made us aware of many incomplete jobs. Affected workers seem to be arm21 and arm22 which also report as "localhost" to o3. In that time frame, I found the following logs:
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] isotovideo has been started (PID: 75018)
May 16 08:06:49 openqaworker-arm21 worker[75018]: [info] 75018: WORKING 4193752
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Isotovideo exit status: 0
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] +++ worker notes +++
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] End time: 2024-05-16 08:06:49
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Result: api-failure
referencing the job https://openqa.opensuse.org/tests/4193752
Restarting the openqa-worker-instances registered the instances correctly against o3 and the seem to be capable of completing jobs again. I haven't tried a reboot yet.
Acceptance criteria¶
- AC1: o3 worker hosts consistently register with a proper hostname and not
localhost
without needing to hardcode the hostname in workers.ini
- AC2: It is still possible to register as "localhost", e.g. for single instance container setup, when using the loopback interface
Suggestions¶
- Check if there was a significant change in openQA which could have caused this
- Use reboots to validate proper registered hostnames for workers on o3
- Research in older tickets if we already had such problems in the past and what we did to resolve
- We should ensure in openQA worker registration code that a "proper hostname" is used for registration while still allowing for "localhost" to register, e.g. for a single instance container setup
- Remove hardcoded FQDNs from workers.ini again after the problem is fixed
- Ensure that the workers register with proper FQDN again over multiple reboots
- Research why the host is relevant here. Shouldn't this hypothetically be fine even with "localhost"?
- Should the worker register as "broken" when it has no proper name? Not just "localhost"
- Consider only allowing "localhost" when the connection is using the local loopback interface
Further details¶
- Category set to Regressions/Crashes
- Target version set to Ready
I found the same for aarch64-o3. It does not have a hostname set until DHCP completes:
Mai 16 03:36:27 localhost wickedd-dhcp4[1536]: eth0: Committed DHCPv4 lease with address 10.168.193.2 (lease time 64795, renew in 32395 sec, rebind in 56695 sec)
Mai 16 03:36:28 aarch64-o3 systemd[1]: Reloading System Logging Service...
This leads to
Mai 16 03:36:29 aarch64-o3 worker[9394]: [warn] Unable to determine worker address (WORKER_HOSTNAME) - checking again for web UI 'https://openqa.opensuse.org' in 201.60 s
...
Mai 16 03:34:59 localhost worker[9381]: - name used to register: localhost
Mai 16 03:34:59 localhost worker[9381]: - worker address (WORKER_HOSTNAME): localhost
and eventually
Mai 16 03:36:39 aarch64-o3 worker[9394]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/1278 finished by remote side with code 1008, only one connection per worker allowed,
I guess the openQA server gets confused if multiple servers register with the same WORKER_HOSTNAME
and connections go to the wrong workers.
For aarch64-o3 I set WORKER_HOSTNAME
explicitly in workers.ini.
- Description updated (diff)
@favogt thanks for handling the situation.
That rings a bell. Maybe we had such problems already in the past? I extended the ACs and suggestions accordingly.
For aarch64-o3 I set WORKER_HOSTNAME explicitly in workers.ini.
@favogt And did you now the same for arm21 and arm22? I'm asking because it looks like you did on both workers (e.g. WORKER_HOSTNAME = openqaworker-arm22
is configured on arm22). If this was already configured before then the recent problem must have had a different source, though.
EDIT: Looks like you haven't done anything yet because /etc/openqa/workers.ini
was only modified on 8 Apr. This is actually in-line with what I remember from looking at the worker properties. I can't tell for sure because the workers have already been cleaned up but I think they were shown with Host: localhost
but with e.g. WORKER_HOSTNAME=openqaworker-arm22
.
Note that it is probably a bug that the worker connects despite not knowing its hostname - especially since it leads to the follow-up problem only one connection per worker allowed
.
- Subject changed from o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete to o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete size:M
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to In Progress
- Assignee set to dheidler
Other workers have the hostname set statically:
openqaworker25:~ # hostnamectl
Static hostname: openqaworker25
Icon name: computer-server
Chassis: server
Machine ID: 3b98eec0757247af86be1759edbed0c3
Boot ID: bf90f41be77d49d8bedc4f1ca616022c
Operating System: openSUSE Leap 15.5
CPE OS Name: cpe:/o:opensuse:leap:15.5
Kernel: Linux 5.14.21-150500.55.62-default
Architecture: x86-64
Hardware Vendor: Happyware
Hardware Model: AS-2014TP-HTR
So let's do that on arm21 and arm22 as well:
openqaworker-arm21:~ # hostnamectl
Static hostname: n/a
Transient hostname: openqaworker-arm21
Icon name: computer-server
Chassis: server
Machine ID: d92eb8deb4a64aa5a55b5d569c73056b
Boot ID: 16eda8bb6ec249c0ae7c0b6255fce649
Operating System: openSUSE Leap 15.5
CPE OS Name: cpe:/o:opensuse:leap:15.5
Kernel: Linux 5.14.21-150500.55.62-default
Architecture: arm64
Hardware Vendor: Giga Computing
Hardware Model: R272-P30-00
openqaworker-arm21:~ # hostnamectl set-hostname openqaworker-arm21
openqaworker-arm21:~ # hostnamectl
Static hostname: openqaworker-arm21
Icon name: computer-server
Chassis: server
Machine ID: d92eb8deb4a64aa5a55b5d569c73056b
Boot ID: 16eda8bb6ec249c0ae7c0b6255fce649
Operating System: openSUSE Leap 15.5
CPE OS Name: cpe:/o:opensuse:leap:15.5
Kernel: Linux 5.14.21-150500.55.62-default
Architecture: arm64
Hardware Vendor: Giga Computing
Hardware Model: R272-P30-00
openqaworker-arm22:~ # hostnamectl set-hostname openqaworker-arm22
aarch64-o3:~ # hostnamectl set-hostname aarch64-o3
- Due date set to 2024-06-05
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Resolved
Rebooted a worker which looked fine.
I guess we can consider this resolved.
- Copied to action #161327: Prevent remote workers registering as "localhost" size:S added
Also available in: Atom
PDF