action #97574
Updated by okurz about 3 years ago
## Observation
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/545558
```
(15/16) Installing: openQA-client-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [.................done]
(16/16) Installing: openQA-worker-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [..................done]
Executing %posttrans script 'dpdk-kmp-default-19.11.4_k5.3.18_lp152.69-lp152.2.12.1.aarch64.rpm' [.....done]
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
.
.
.
…
.
ERROR: Job failed: execution took longer than 1h0m0s seconds
```
coolo pointed out a failed job
https://openqa.suse.de/tests/6954971
which shows
```
``
Updating files: 100% (3779/3779)
Updating files: 100% (3779/3779), done.
encountered object '/var/lib/openqa/pool/2/openqa-tests', but neither allow_blessed, convert_blessed nor allow_tags settings are enabled (or TO_JSON/FREEZE method missing) at /usr/lib/os-autoinst/bmwqemu.pm line 119.
12014: EXIT 1
```
I triggered rollback which then failed:
```
(6/6) Installing: openQA-worker-4.6.1629814422.9faec0365-lp152.4304.1.noarch [..................done]
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
ERROR: Minions returned with non-zero exit code
.
+++ kill %1
Cleaning up file based variables 00:00
ERROR: Job failed: command terminated with exit code 1
```
## Open points
* How to prevent similar problems by better tests in os-autoinst
* deployment failed and "alarmed" us but we did not know that jobs were impacted -> split deployment for webUI and workers which can more often be offline?
* https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=query&editPanel=17&orgId=1&from=1630041569017&to=1630065211489 shows a symptom of affected jobs but our alert level is too low -> lower threshold or different alert on incomplete (no restarts) rate?
## Rollback steps
* Re-enable OSD deployment in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit