action #160424
Updated by okurz 7 months ago
## Motivation
In https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$jz0FLvQkzgOBQ1r0WC6ARuAFR0hRBHCO_fYSxDFH2gY @Guillaume_G made us aware of many incomplete jobs. Affected workers seem to be arm21 and arm22 which also report as "localhost" to o3. In that time frame, I found the following logs:
```
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] isotovideo has been started (PID: 75018)
May 16 08:06:49 openqaworker-arm21 worker[75018]: [info] 75018: WORKING 4193752
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.
May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Isotovideo exit status: 0
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] +++ worker notes +++
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] End time: 2024-05-16 08:06:49
May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Result: api-failure
```
Restarting the openqa-worker-instances registered the instances correctly against o3 and the seem to be capable of completing jobs again. I haven't tried a reboot yet.
## Acceptance criteria
* **AC1:** o3 worker hosts consistently register with a proper hostname and not `localhost` without needing to hardcode the hostname in workers.ini
## Suggestions
* Check if there was a significant change in openQA which could have caused this
* Use reboots to validate proper registered hostnames for workers on o3
* Research in older tickets if we already had such problems in the past and what we did to resolve
* We should ensure in openQA worker registration code that a "proper hostname" is used for registration while still allowing for "localhost" to register, e.g. for a single instance container setup
* Remove hardcoded FQDNs from workers.ini again after the problem is fixed
* Ensure that the workers register with proper FQDN again over multiple reboots
Back