action #160424
Updated by livdywan 10 months ago
## Motivation In https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$jz0FLvQkzgOBQ1r0WC6ARuAFR0hRBHCO_fYSxDFH2gY @Guillaume_G made us aware of many incomplete jobs. Affected workers seem to be arm21 and arm22 which also report as "localhost" to o3. In that time frame, I found the following logs: ``` May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] isotovideo has been started (PID: 75018) May 16 08:06:49 openqaworker-arm21 worker[75018]: [info] 75018: WORKING 4193752 May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0) May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead. May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0) May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Isotovideo exit status: 0 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] +++ worker notes +++ May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] End time: 2024-05-16 08:06:49 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Result: api-failure ``` referencing the job https://openqa.opensuse.org/tests/4193752 Restarting the openqa-worker-instances registered the instances correctly against o3 and the seem to be capable of completing jobs again. I haven't tried a reboot yet. ## Acceptance criteria * **AC1:** o3 worker hosts consistently register with a proper hostname and not `localhost` without needing to hardcode the hostname in workers.ini * **AC2:** It is still possible to register as "localhost", e.g. for single instance container setup, when using the loopback interface ## Suggestions * Check if there was a significant change in openQA which could have caused this * Use reboots to validate proper registered hostnames for workers on o3 * Research in older tickets if we already had such problems in the past and what we did to resolve * We should ensure in openQA worker registration code that a "proper hostname" is used for registration while still allowing for "localhost" to register, e.g. for a single instance container setup * Remove hardcoded FQDNs from workers.ini again after the problem is fixed * Ensure that the workers register with proper FQDN again over multiple reboots * Research why the host is relevant here. Shouldn't this hypothetically be fine even with "localhost"? * Should the worker register as "broken" when it has no proper name? Not just "localhost" * Consider only allowing "localhost" when the connection is using the local loopback interface ## Further details * `WORKER_HOSTNAME` was always correct (see #160424#note-4). This is about the "host" entry, e.g. on top of the info box in https://openqa.opensuse.org/admin/workers/1132