Project

General

Profile

Actions

action #160424

closed

o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete size:M

Added by nicksinger 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-16
Due date:
2024-06-05
% Done:

0%

Estimated time:

Description

Motivation

In https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$jz0FLvQkzgOBQ1r0WC6ARuAFR0hRBHCO_fYSxDFH2gY @Guillaume_G made us aware of many incomplete jobs. Affected workers seem to be arm21 and arm22 which also report as "localhost" to o3. In that time frame, I found the following logs:

May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] isotovideo has been started (PID: 75018)
May 16 08:06:49 openqaworker-arm21 worker[75018]: [info] 75018: WORKING 4193752
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Isotovideo exit status: 0
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] +++ worker notes +++
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] End time: 2024-05-16 08:06:49
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Result: api-failure

referencing the job https://openqa.opensuse.org/tests/4193752

Restarting the openqa-worker-instances registered the instances correctly against o3 and the seem to be capable of completing jobs again. I haven't tried a reboot yet.

Acceptance criteria

  • AC1: o3 worker hosts consistently register with a proper hostname and not localhost without needing to hardcode the hostname in workers.ini
  • AC2: It is still possible to register as "localhost", e.g. for single instance container setup, when using the loopback interface

Suggestions

  • Check if there was a significant change in openQA which could have caused this
  • Use reboots to validate proper registered hostnames for workers on o3
  • Research in older tickets if we already had such problems in the past and what we did to resolve
  • We should ensure in openQA worker registration code that a "proper hostname" is used for registration while still allowing for "localhost" to register, e.g. for a single instance container setup
  • Remove hardcoded FQDNs from workers.ini again after the problem is fixed
  • Ensure that the workers register with proper FQDN again over multiple reboots
  • Research why the host is relevant here. Shouldn't this hypothetically be fine even with "localhost"?
  • Should the worker register as "broken" when it has no proper name? Not just "localhost"
  • Consider only allowing "localhost" when the connection is using the local loopback interface

Further details


Related issues 1 (0 open1 closed)

Copied to openQA Project - action #161327: Prevent remote workers registering as "localhost" size:SResolvedokurz

Actions
Actions #1

Updated by okurz 2 months ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #2

Updated by favogt 2 months ago

I found the same for aarch64-o3. It does not have a hostname set until DHCP completes:

Mai 16 03:36:27 localhost wickedd-dhcp4[1536]: eth0: Committed DHCPv4 lease with address 10.168.193.2 (lease time 64795, renew in 32395 sec, rebind in 56695 sec)
Mai 16 03:36:28 aarch64-o3 systemd[1]: Reloading System Logging Service...

This leads to

Mai 16 03:36:29 aarch64-o3 worker[9394]: [warn] Unable to determine worker address (WORKER_HOSTNAME) - checking again for web UI 'https://openqa.opensuse.org' in 201.60 s
...
Mai 16 03:34:59 localhost worker[9381]:  - name used to register:            localhost
Mai 16 03:34:59 localhost worker[9381]:  - worker address (WORKER_HOSTNAME): localhost

and eventually

Mai 16 03:36:39 aarch64-o3 worker[9394]: [warn] Websocket connection to https://openqa.opensuse.org/api/v1/ws/1278 finished by remote side with code 1008, only one connection per worker allowed,

I guess the openQA server gets confused if multiple servers register with the same WORKER_HOSTNAME and connections go to the wrong workers.

For aarch64-o3 I set WORKER_HOSTNAME explicitly in workers.ini.

Actions #3

Updated by okurz 2 months ago

  • Description updated (diff)

@favogt thanks for handling the situation.

That rings a bell. Maybe we had such problems already in the past? I extended the ACs and suggestions accordingly.

Actions #4

Updated by mkittler 2 months ago · Edited

For aarch64-o3 I set WORKER_HOSTNAME explicitly in workers.ini.

@favogt And did you now the same for arm21 and arm22? I'm asking because it looks like you did on both workers (e.g. WORKER_HOSTNAME = openqaworker-arm22 is configured on arm22). If this was already configured before then the recent problem must have had a different source, though.

EDIT: Looks like you haven't done anything yet because /etc/openqa/workers.ini was only modified on 8 Apr. This is actually in-line with what I remember from looking at the worker properties. I can't tell for sure because the workers have already been cleaned up but I think they were shown with Host: localhost but with e.g. WORKER_HOSTNAME=openqaworker-arm22.


Note that it is probably a bug that the worker connects despite not knowing its hostname - especially since it leads to the follow-up problem only one connection per worker allowed.

Actions #5

Updated by livdywan about 2 months ago

  • Subject changed from o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete to o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by dheidler about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #7

Updated by dheidler about 2 months ago · Edited

Other workers have the hostname set statically:

openqaworker25:~ # hostnamectl
 Static hostname: openqaworker25
       Icon name: computer-server
         Chassis: server
      Machine ID: 3b98eec0757247af86be1759edbed0c3
         Boot ID: bf90f41be77d49d8bedc4f1ca616022c
Operating System: openSUSE Leap 15.5
     CPE OS Name: cpe:/o:opensuse:leap:15.5
          Kernel: Linux 5.14.21-150500.55.62-default
    Architecture: x86-64
 Hardware Vendor: Happyware
  Hardware Model: AS-2014TP-HTR

So let's do that on arm21 and arm22 as well:

openqaworker-arm21:~ # hostnamectl
   Static hostname: n/a
Transient hostname: openqaworker-arm21
         Icon name: computer-server
           Chassis: server
        Machine ID: d92eb8deb4a64aa5a55b5d569c73056b
           Boot ID: 16eda8bb6ec249c0ae7c0b6255fce649
  Operating System: openSUSE Leap 15.5
       CPE OS Name: cpe:/o:opensuse:leap:15.5
            Kernel: Linux 5.14.21-150500.55.62-default
      Architecture: arm64
   Hardware Vendor: Giga Computing
    Hardware Model: R272-P30-00
openqaworker-arm21:~ # hostnamectl set-hostname openqaworker-arm21
openqaworker-arm21:~ # hostnamectl
 Static hostname: openqaworker-arm21
       Icon name: computer-server
         Chassis: server
      Machine ID: d92eb8deb4a64aa5a55b5d569c73056b
         Boot ID: 16eda8bb6ec249c0ae7c0b6255fce649
Operating System: openSUSE Leap 15.5
     CPE OS Name: cpe:/o:opensuse:leap:15.5
          Kernel: Linux 5.14.21-150500.55.62-default
    Architecture: arm64
 Hardware Vendor: Giga Computing
  Hardware Model: R272-P30-00
openqaworker-arm22:~ # hostnamectl set-hostname openqaworker-arm22
aarch64-o3:~ # hostnamectl set-hostname aarch64-o3
Actions #8

Updated by openqa_review about 2 months ago

  • Due date set to 2024-06-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by dheidler about 2 months ago

  • Status changed from In Progress to Resolved

Rebooted a worker which looked fine.
I guess we can consider this resolved.

Actions #10

Updated by okurz about 2 months ago

  • Copied to action #161327: Prevent remote workers registering as "localhost" size:S added
Actions

Also available in: Atom PDF