Project

General

Profile

action #97574

Updated by okurz over 2 years ago

## Observation 

 https://gitlab.suse.de/openqa/osd-deployment/-/jobs/545558 

 ``` 
     (15/16) Installing: openQA-client-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [.................done] 
     (16/16) Installing: openQA-worker-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [..................done] 
     Executing %posttrans script 'dpdk-kmp-default-19.11.4_k5.3.18_lp152.69-lp152.2.12.1.aarch64.rpm' [.....done] 
     There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs. 
     
     Since the last system boot core libraries or services have been updated. 
     Reboot is suggested to ensure that your system benefits from these updates. 
 . 
 . 
 . 
 … 
 . 
 ERROR: Job failed: execution took longer than 1h0m0s seconds 
 ``` 

 coolo pointed out a failed job 
 https://openqa.suse.de/tests/6954971 
 which shows 

 ``` 
 `` 
 Updating files: 100% (3779/3779) 
 Updating files: 100% (3779/3779), done. 
  
 encountered object '/var/lib/openqa/pool/2/openqa-tests', but neither allow_blessed, convert_blessed nor allow_tags settings are enabled (or TO_JSON/FREEZE method missing) at /usr/lib/os-autoinst/bmwqemu.pm line 119. 
 12014: EXIT 1 
 ``` 

 I triggered rollback which then failed: 

 ``` 
     (6/6) Installing: openQA-worker-4.6.1629814422.9faec0365-lp152.4304.1.noarch [..................done] 
     There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs. 
     
     Since the last system boot core libraries or services have been updated. 
     Reboot is suggested to ensure that your system benefits from these updates. 
 ERROR: Minions returned with non-zero exit code 
 . 
 +++ kill %1 
 Cleaning up file based variables 00:00 
 ERROR: Job failed: command terminated with exit code 1 
 ``` 

 ## Open points 
 * How to prevent similar problems by better tests in os-autoinst 
 * deployment failed and "alarmed" us but we did not know that jobs were impacted -> split deployment for webUI and workers which can more often be offline? 
 * https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=query&editPanel=17&orgId=1&from=1630041569017&to=1630065211489 shows a symptom of affected jobs but our alert level is too low -> lower threshold or different alert on incomplete (no restarts) rate? 

 ## Rollback steps 

 * Re-enable OSD deployment in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit 
 * Add back openqaworker-arm-2

Back