action #97574: deployment failed in gitlab job with "ERROR: Job failed: execution took longer than 1h0m0s seconds" - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #97574

closed

deployment failed in gitlab job with "ERROR: Job failed: execution took longer than 1h0m0s seconds"

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-08-27

Due date:

2021-09-11

% Done:

Estimated time:

Description

Observation¶

https://gitlab.suse.de/openqa/osd-deployment/-/jobs/545558

    (15/16) Installing: openQA-client-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [.................done]
    (16/16) Installing: openQA-worker-4.6.1629997637.5c3f9e2dd-lp152.4317.1.noarch [..................done]
    Executing %posttrans script 'dpdk-kmp-default-19.11.4_k5.3.18_lp152.69-lp152.2.12.1.aarch64.rpm' [.....done]
    There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
     
    Since the last system boot core libraries or services have been updated.
    Reboot is suggested to ensure that your system benefits from these updates.
.
.
.
…
.
ERROR: Job failed: execution took longer than 1h0m0s seconds

coolo pointed out a failed job
https://openqa.suse.de/tests/6954971
which shows

``
Updating files: 100% (3779/3779)
Updating files: 100% (3779/3779), done.
  
encountered object '/var/lib/openqa/pool/2/openqa-tests', but neither allow_blessed, convert_blessed nor allow_tags settings are enabled (or TO_JSON/FREEZE method missing) at /usr/lib/os-autoinst/bmwqemu.pm line 119.
12014: EXIT 1

I triggered rollback which then failed:

    (6/6) Installing: openQA-worker-4.6.1629814422.9faec0365-lp152.4304.1.noarch [..................done]
    There are running programs which still use files and libraries deleted or updated by recent upgrades. They should be restarted to benefit from the latest updates. Run 'zypper ps -s' to list these programs.
     
    Since the last system boot core libraries or services have been updated.
    Reboot is suggested to ensure that your system benefits from these updates.
ERROR: Minions returned with non-zero exit code
.
+++ kill %1
Cleaning up file based variables 00:00
ERROR: Job failed: command terminated with exit code 1

Open points¶

How to prevent similar problems by better tests in os-autoinst
deployment failed and "alarmed" us but we did not know that jobs were impacted -> split deployment for webUI and workers which can more often be offline?
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?tab=query&editPanel=17&orgId=1&from=1630041569017&to=1630065211489 shows a symptom of affected jobs but our alert level is too low -> lower threshold or different alert on incomplete (no restarts) rate?

Rollback steps¶

DONE: Re-enable OSD deployment in https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit
DONE: Add back openqaworker-arm-2

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 4 years ago

Status changed from New to In Progress
Assignee set to okurz

Retriggered the one mentioned job as https://openqa.suse.de/tests/6955090

Actions

Copy link

Updated by okurz almost 4 years ago

sudden rise in incompletes on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=17&orgId=1&from=1630028028398&to=1630050969088 so triggered env host=openqa.suse.de openqa-advanced-retrigger-jobs

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)

Disabled OSD deployment until this is ruled out: https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit

Actions

Copy link

Updated by okurz almost 4 years ago

Mainly tinita and mkittler have identified https://github.com/os-autoinst/os-autoinst/pull/1748 as the culprit and together we have prepared a change https://github.com/os-autoinst/os-autoinst/pull/1757

Checking test coverage from the culprit PR https://app.codecov.io/gh/os-autoinst/os-autoinst/compare/1748/diff#diff-Ym13cWVtdS5wbQ== shows that we do have good statement coverage in the changed areas. We do not check types of return values and I wonder if we should at all in perl.

Actions

Copy link

Updated by okurz almost 4 years ago

Priority changed from Urgent to High

paused the deployment, os-autoinst PR merged, can trigger deployment later today, e.g. evening

Actions

Copy link

Updated by tinita almost 4 years ago

okurz wrote:

Checking test coverage from the culprit PR https://app.codecov.io/gh/os-autoinst/os-autoinst/compare/1748/diff#diff-Ym13cWVtdS5wbQ== shows that we do have good statement coverage in the changed areas. We do not check types of return values and I wonder if we should at all in perl.

Remember that codecov only shows line coverage, while Devel::Cover is able to do statement and branch coverage. We should at least upload the Devel::Cover report as an artifact (as we do already for openQA in CircleCI).

Since our tests did not fail, it's possible that the case where the variable in question is a file object is not covered by our tests.
And it's possible to have 100% code coverage and still having the problem that certain scenarios are not covered.

So in this case we should check if the scenario happens in one of our tests.

Actions

Copy link

Updated by okurz almost 4 years ago

Priority changed from High to Urgent

tinita wrote:

And it's possible to have 100% code coverage and still having the problem that certain scenarios are not covered.

of course

So in this case we should check if the scenario happens in one of our tests.

Well, we have the full stack test and a git clone test. I guess only a test scenario that would combine both could trigger a problem that a path object is part of %bmwqemu::vars and we try to serialize that by saving the variables

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)
Priority changed from Urgent to High

fix was merged and packages have been built. I triggered a new deployment pipeline: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/192833

Noted down "open points" to discuss as follow-ups

Actions

Copy link

#10

Updated by livdywan almost 4 years ago

I found a way to hit the problem in the git test: https://github.com/os-autoinst/os-autoinst/pull/1758

Actions

Copy link

#11

Updated by openqa_review almost 4 years ago

Due date set to 2021-09-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#12

Updated by okurz almost 4 years ago

merged https://github.com/os-autoinst/os-autoinst/pull/1758 , thanks cdywan.

Triggered a new deployment now: https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/194658 , currently running

Actions

Copy link

#13

Updated by okurz almost 4 years ago

Description updated (diff)

openqaworker-arm-2 seems to show problems in deployment, seems non-responsive. I removed from salt and retriggered deployment.

Actions

Copy link

#14

Updated by okurz almost 4 years ago

Description updated (diff)

deployment step passed but post-monitor failed as openqaworker-arm-2 was not reporting as online. Fixed that manually and retriggered pipeline step, succeeded. Readded openqaworker-arm-2 to salt and did zypper dup to bring it up to the same state as others. Increased CI job timeout in https://gitlab.suse.de/openqa/osd-deployment/-/settings/ci_cd#js-general-pipeline-settings from 1h to 2h.

Actions

Copy link

#15

Updated by okurz almost 4 years ago

Status changed from In Progress to Feedback

And finally created https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/33 to ignore if individual salt nodes do not respond in time within the actual deployment step when we can't go back.

Actions

Copy link

#16

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

merged https://gitlab.suse.de/openqa/osd-deployment/-/merge_requests/33

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #97574

deployment failed in gitlab job with "ERROR: Job failed: execution took longer than 1h0m0s seconds"

Observation¶

Open points¶

Rollback steps¶

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by tinita almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz almost 4 years ago

Updated by okurz over 3 years ago