Project

General

Profile

action #97574

deployment failed in gitlab job with "ERROR: Job failed: execution took longer than 1h0m0s seconds"

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-08-27
Due date:
2021-09-11
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/545558

    (15/16) Installing: openQA-client-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [.................done]
    (16/16) Installing: openQA-worker-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [..................done]
    Executing %posttrans script 'dpdk-kmp-default-19.11.4_k5.3.18_lp152.69-lp152.2.12.1.aarch64.rpm' [.....done]
    There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.

    Since the last system boot core libraries or services have been updated.
    Reboot is suggested to ensure that your system benefits from these updates.
.
.
.
…
.
ERROR: Job failed: execution took longer than 1h0m0s seconds

coolo pointed out a failed job
https://openqa.suse.de/tests/6954971
which shows

``
Updating files: 100% (3779/3779)
Updating files: 100% (3779/3779), done.

encountered object '/var/lib/openqa/pool/2/openqa-tests', but neither allow_blessed, convert_blessed nor allow_tags settings are enabled (or TO_JSON/FREEZE method missing) at /usr/lib/os-autoinst/bmwqemu.pm line 119.
12014: EXIT 1

I triggered rollback which then failed:

    (6/6) Installing: openQA-worker-4.6.1629814422.9faec0365-lp152.4304.1.noarch [..................done]
    There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.

    Since the last system boot core libraries or services have been updated.
    Reboot is suggested to ensure that your system benefits from these updates.
ERROR: Minions returned with non-zero exit code
.
+++ kill %1
Cleaning up file based variables 00:00
ERROR: Job failed: command terminated with exit code 1

Open points

Rollback steps

History

#1 Updated by okurz 3 months ago

  • Description updated (diff)

#2 Updated by okurz 3 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

Retriggered the one mentioned job as https://openqa.suse.de/tests/6955090

#3 Updated by okurz 3 months ago

sudden rise in incompletes on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1&from=1630028028398&to=1630050969088 so triggered env host=openqa.suse.de openqa-advanced-retrigger-jobs

#4 Updated by okurz 3 months ago

  • Description updated (diff)

Disabled OSD deployment until this is ruled out: https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit

#5 Updated by okurz 3 months ago

Mainly tinita and mkittler have identified https://github.com/os-autoinst/os-autoinst/pull/1748 as the culprit and together we have prepared a change https://github.com/os-autoinst/os-autoinst/pull/1757

Checking test coverage from the culprit PR https://app.codecov.io/gh/os-autoinst/os-autoinst/compare/1748/diff#diff-Ym13cWVtdS5wbQ== shows that we do have good statement coverage in the changed areas. We do not check types of return values and I wonder if we should at all in perl.

#6 Updated by okurz 3 months ago

  • Priority changed from Urgent to High

paused the deployment, os-autoinst PR merged, can trigger deployment later today, e.g. evening

#7 Updated by tinita 3 months ago

okurz wrote:

Checking test coverage from the culprit PR https://app.codecov.io/gh/os-autoinst/os-autoinst/compare/1748/diff#diff-Ym13cWVtdS5wbQ== shows that we do have good statement coverage in the changed areas. We do not check types of return values and I wonder if we should at all in perl.

Remember that codecov only shows line coverage, while Devel::Cover is able to do statement and branch coverage. We should at least upload the Devel::Cover report as an artifact (as we do already for openQA in CircleCI).

Since our tests did not fail, it's possible that the case where the variable in question is a file object is not covered by our tests.
And it's possible to have 100% code coverage and still having the problem that certain scenarios are not covered.

So in this case we should check if the scenario happens in one of our tests.

#8 Updated by okurz 3 months ago

  • Priority changed from High to Urgent

tinita wrote:

And it's possible to have 100% code coverage and still having the problem that certain scenarios are not covered.

of course

So in this case we should check if the scenario happens in one of our tests.

Well, we have the full stack test and a git clone test. I guess only a test scenario that would combine both could trigger a problem that a path object is part of %bmwqemu::vars and we try to serialize that by saving the variables

#9 Updated by okurz 3 months ago

  • Description updated (diff)
  • Priority changed from Urgent to High

fix was merged and packages have been built. I triggered a new deployment pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/192833

Noted down "open points" to discuss as follow-ups

#10 Updated by cdywan 3 months ago

I found a way to hit the problem in the git test: https://github.com/os-autoinst/os-autoinst/pull/1758

#11 Updated by openqa_review 3 months ago

  • Due date set to 2021-09-11

Setting due date based on mean cycle time of SUSE QE Tools

#12 Updated by okurz 3 months ago

merged https://github.com/os-autoinst/os-autoinst/pull/1758 , thanks cdywan.

Triggered a new deployment now: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/194658 , currently running

#13 Updated by okurz 3 months ago

  • Description updated (diff)

openqaworker-arm-2 seems to show problems in deployment, seems non-responsive. I removed from salt and retriggered deployment.

#14 Updated by okurz 3 months ago

  • Description updated (diff)

deployment step passed but post-monitor failed as openqaworker-arm-2 was not reporting as online. Fixed that manually and retriggered pipeline step, succeeded. Readded openqaworker-arm-2 to salt and did zypper dup to bring it up to the same state as others. Increased CI job timeout in https://gitlab.suse.de/openqa/osd-deployment/-/settings/ci_cd#js-general-pipeline-settings from 1h to 2h.

#15 Updated by okurz 3 months ago

  • Status changed from In Progress to Feedback

And finally created https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/33 to ignore if individual salt nodes do not respond in time within the actual deployment step when we can't go back.

#16 Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF