Project

General

Profile

Actions

action #160424

closed

o3: arm21 & arm22 show up as "localhost", (almost) all jobs incomplete size:M

Added by nicksinger 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-05-16
Due date:
2024-06-05
% Done:

0%

Estimated time:

Description

Motivation

In https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$jz0FLvQkzgOBQ1r0WC6ARuAFR0hRBHCO_fYSxDFH2gY @Guillaume_G made us aware of many incomplete jobs. Affected workers seem to be arm21 and arm22 which also report as "localhost" to o3. In that time frame, I found the following logs:

May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] isotovideo has been started (PID: 75018)
May 16 08:06:49 openqaworker-arm21 worker[75018]: [info] 75018: WORKING 4193752
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Isotovideo exit status: 0
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] +++ worker notes +++
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] End time: 2024-05-16 08:06:49
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Result: api-failure

referencing the job https://openqa.opensuse.org/tests/4193752

Restarting the openqa-worker-instances registered the instances correctly against o3 and the seem to be capable of completing jobs again. I haven't tried a reboot yet.

Acceptance criteria

  • AC1: o3 worker hosts consistently register with a proper hostname and not localhost without needing to hardcode the hostname in workers.ini
  • AC2: It is still possible to register as "localhost", e.g. for single instance container setup, when using the loopback interface

Suggestions

  • Check if there was a significant change in openQA which could have caused this
  • Use reboots to validate proper registered hostnames for workers on o3
  • Research in older tickets if we already had such problems in the past and what we did to resolve
  • We should ensure in openQA worker registration code that a "proper hostname" is used for registration while still allowing for "localhost" to register, e.g. for a single instance container setup
  • Remove hardcoded FQDNs from workers.ini again after the problem is fixed
  • Ensure that the workers register with proper FQDN again over multiple reboots
  • Research why the host is relevant here. Shouldn't this hypothetically be fine even with "localhost"?
  • Should the worker register as "broken" when it has no proper name? Not just "localhost"
  • Consider only allowing "localhost" when the connection is using the local loopback interface

Further details


Related issues 1 (0 open1 closed)

Copied to openQA Project - action #161327: Prevent remote workers registering as "localhost" size:SResolvedokurz

Actions
Actions

Also available in: Atom PDF