action #97574
Updated by okurz about 3 years ago
## Observation https://gitlab.suse.de/openqa/osd-deployment/-/jobs/545558 ``` (15/16) Installing: openQA-client-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [.................done] (16/16) Installing: openQA-worker-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [..................done] Executing %posttrans script 'dpdk-kmp-default-19.11.4_k5.3.18_lp152.69-lp152.2.12.1.aarch64.rpm' [.....done] There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs. Since the last system boot core libraries or services have been updated. Reboot is suggested to ensure that your system benefits from these updates. . . . … . ERROR: Job failed: execution took longer than 1h0m0s seconds ``` coolo pointed out a failed job https://openqa.suse.de/tests/6954971 which shows ``` `` Updating files: 100% (3779/3779) Updating files: 100% (3779/3779), done. encountered object '/var/lib/openqa/pool/2/openqa-tests', but neither allow_blessed, convert_blessed nor allow_tags settings are enabled (or TO_JSON/FREEZE method missing) at /usr/lib/os-autoinst/bmwqemu.pm line 119. 12014: EXIT 1 ``` I triggered rollback which then failed: ``` (6/6) Installing: openQA-worker-4.6.1629814422.9faec0365-lp152.4304.1.noarch [..................done] There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs. Since the last system boot core libraries or services have been updated. Reboot is suggested to ensure that your system benefits from these updates. ERROR: Minions returned with non-zero exit code . +++ kill %1 Cleaning up file based variables 00:00 ERROR: Job failed: command terminated with exit code 1 ``` ## Open points * How to prevent similar problems by better tests in os-autoinst * deployment failed and "alarmed" us but we did not know that jobs were impacted -> split deployment for webUI and workers which can more often be offline? * https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=query&editPanel=17&orgId=1&from=1630041569017&to=1630065211489 shows a symptom of affected jobs but our alert level is too low -> lower threshold or different alert on incomplete (no restarts) rate? ## Rollback steps * DONE: Re-enable OSD deployment in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit * DONE: Add back openqaworker-arm-2