Project

General

Profile

action #160424

Updated by okurz 2 months ago

## Motivation 
 In https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$jz0FLvQkzgOBQ1r0WC6ARuAFR0hRBHCO_fYSxDFH2gY @Guillaume_G made us aware of many incomplete jobs. Affected workers seem to be arm21 and arm22 which also report as "localhost" to o3. In that time frame, I found the following logs: 

 ``` 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] isotovideo has been started (PID: 75018) 
 May 16 08:06:49 openqaworker-arm21 worker[75018]: [info] 75018: WORKING 4193752 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0) 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead. 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [error] REST-API error (POST https://openqa.opensuse.org/api/v1/jobs/4193752/status): 400 response: Got status update for job 4193752 with unexpected worker ID 1267 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0) 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Isotovideo exit status: 0 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] +++ worker notes +++ 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] End time: 2024-05-16 08:06:49 
 May 16 08:06:49 openqaworker-arm21 worker[18452]: [info] Result: api-failure 
 ``` 

 Restarting the openqa-worker-instances registered the instances correctly against o3 and the seem to be capable of completing jobs again. I haven't tried a reboot yet. 

 ## Acceptance criteria 
 * **AC1:** o3 worker hosts consistently register with a proper hostname and not `localhost` without needing to hardcode the hostname in workers.ini 

 ## Suggestions 
 * Check if there was a significant change in openQA which could have caused this 
 * Use reboots to validate proper registered hostnames for workers on o3 
 * Research in older tickets if we already had such problems in the past and what we did to resolve 
 * We should ensure in openQA worker registration code that a "proper hostname" is used for registration while still allowing for "localhost" to register, e.g. for a single instance container setup 
 * Remove hardcoded FQDNs from workers.ini again after the problem is fixed 
 * Ensure that the workers register with proper FQDN again over multiple reboots

Back