action #97574
closeddeployment failed in gitlab job with "ERROR: Job failed: execution took longer than 1h0m0s seconds"
0%
Description
Observation¶
https://gitlab.suse.de/openqa/osd-deployment/-/jobs/545558
(15/16) Installing: openQA-client-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [.................done]
(16/16) Installing: openQA-worker-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [..................done]
Executing %posttrans script 'dpdk-kmp-default-19.11.4_k5.3.18_lp152.69-lp152.2.12.1.aarch64.rpm' [.....done]
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
.
.
.
…
.
ERROR: Job failed: execution took longer than 1h0m0s seconds
coolo pointed out a failed job
https://openqa.suse.de/tests/6954971
which shows
``
Updating files: 100% (3779/3779)
Updating files: 100% (3779/3779), done.
encountered object '/var/lib/openqa/pool/2/openqa-tests', but neither allow_blessed, convert_blessed nor allow_tags settings are enabled (or TO_JSON/FREEZE method missing) at /usr/lib/os-autoinst/bmwqemu.pm line 119.
12014: EXIT 1
I triggered rollback which then failed:
(6/6) Installing: openQA-worker-4.6.1629814422.9faec0365-lp152.4304.1.noarch [..................done]
There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
Since the last system boot core libraries or services have been updated.
Reboot is suggested to ensure that your system benefits from these updates.
ERROR: Minions returned with non-zero exit code
.
+++ kill %1
Cleaning up file based variables 00:00
ERROR: Job failed: command terminated with exit code 1
Open points¶
- How to prevent similar problems by better tests in os-autoinst
- deployment failed and "alarmed" us but we did not know that jobs were impacted -> split deployment for webUI and workers which can more often be offline?
- https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=query&editPanel=17&orgId=1&from=1630041569017&to=1630065211489 shows a symptom of affected jobs but our alert level is too low -> lower threshold or different alert on incomplete (no restarts) rate?
Rollback steps¶
- DONE: Re-enable OSD deployment in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
- DONE: Add back openqaworker-arm-2
Updated by okurz over 3 years ago
- Status changed from New to In Progress
- Assignee set to okurz
Retriggered the one mentioned job as https://openqa.suse.de/tests/6955090
Updated by okurz over 3 years ago
sudden rise in incompletes on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1&from=1630028028398&to=1630050969088 so triggered env host=openqa.suse.de openqa-advanced-retrigger-jobs
Updated by okurz over 3 years ago
- Description updated (diff)
Disabled OSD deployment until this is ruled out: https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
Updated by okurz over 3 years ago
Mainly tinita and mkittler have identified https://github.com/os-autoinst/os-autoinst/pull/1748 as the culprit and together we have prepared a change https://github.com/os-autoinst/os-autoinst/pull/1757
Checking test coverage from the culprit PR https://app.codecov.io/gh/os-autoinst/os-autoinst/compare/1748/diff#diff-Ym13cWVtdS5wbQ== shows that we do have good statement coverage in the changed areas. We do not check types of return values and I wonder if we should at all in perl.
Updated by okurz over 3 years ago
- Priority changed from Urgent to High
paused the deployment, os-autoinst PR merged, can trigger deployment later today, e.g. evening
Updated by tinita over 3 years ago
okurz wrote:
Checking test coverage from the culprit PR https://app.codecov.io/gh/os-autoinst/os-autoinst/compare/1748/diff#diff-Ym13cWVtdS5wbQ== shows that we do have good statement coverage in the changed areas. We do not check types of return values and I wonder if we should at all in perl.
Remember that codecov only shows line coverage, while Devel::Cover is able to do statement and branch coverage. We should at least upload the Devel::Cover report as an artifact (as we do already for openQA in CircleCI).
Since our tests did not fail, it's possible that the case where the variable in question is a file object is not covered by our tests.
And it's possible to have 100% code coverage and still having the problem that certain scenarios are not covered.
So in this case we should check if the scenario happens in one of our tests.
Updated by okurz over 3 years ago
- Priority changed from High to Urgent
tinita wrote:
And it's possible to have 100% code coverage and still having the problem that certain scenarios are not covered.
of course
So in this case we should check if the scenario happens in one of our tests.
Well, we have the full stack test and a git clone test. I guess only a test scenario that would combine both could trigger a problem that a path object is part of %bmwqemu::vars and we try to serialize that by saving the variables
Updated by okurz over 3 years ago
- Description updated (diff)
- Priority changed from Urgent to High
fix was merged and packages have been built. I triggered a new deployment pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/192833
Noted down "open points" to discuss as follow-ups
Updated by livdywan over 3 years ago
I found a way to hit the problem in the git test: https://github.com/os-autoinst/os-autoinst/pull/1758
Updated by openqa_review over 3 years ago
- Due date set to 2021-09-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 3 years ago
merged https://github.com/os-autoinst/os-autoinst/pull/1758 , thanks cdywan.
Triggered a new deployment now: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/194658 , currently running
Updated by okurz over 3 years ago
- Description updated (diff)
openqaworker-arm-2 seems to show problems in deployment, seems non-responsive. I removed from salt and retriggered deployment.
Updated by okurz over 3 years ago
- Description updated (diff)
deployment step passed but post-monitor failed as openqaworker-arm-2 was not reporting as online. Fixed that manually and retriggered pipeline step, succeeded. Readded openqaworker-arm-2 to salt and did zypper dup
to bring it up to the same state as others. Increased CI job timeout in https://gitlab.suse.de/openqa/osd-deployment/-/settings/ci_cd#js-general-pipeline-settings from 1h to 2h.
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
And finally created https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/33 to ignore if individual salt nodes do not respond in time within the actual deployment step when we can't go back.
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved