Project

General

Profile

Actions

action #169204

closed

[sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running size:S

Added by tinita about 2 months ago. Updated 23 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-11-01
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_multimachine@64bit-4G fails in
test_running

The test displays as running only after 29 retries, so we never see it as done before the 30 retries are exceeded.
Looking at previous tests, the oldest 21 days ago, it always seems to take quite a while until the test starts running. It looks like it got a little bit worse in the last 24 hours, judging by the overall runtime of the test.
Possibly a slow asset download?

Reproducible

Fails since (at least) Build :TW.32460 (current job)

Expected result

Last good: :TW.32459 (or more recent)

Further details

Always latest result in this scenario: latest

Suggestions

  • Disable git_auto_update feature in this test
  • Look into how git_auto_update failures are retried and limited in time

Related issues 2 (2 open0 closed)

Related to openQA Project (public) - action #169249: [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command[\s\S]*openqa-cli api jobs" size:SIn Progressybonatakis2024-11-042024-12-31

Actions
Related to openQA Project (public) - action #168379: Enable automatic openQA git clone by default size:SBlockedmkittler2024-10-17

Actions
Actions #1

Updated by okurz about 2 months ago

  • Priority changed from Normal to High
Actions #2

Updated by tinita about 2 months ago

  • Subject changed from [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running to [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running
Actions #3

Updated by tinita about 2 months ago

  • Related to action #169249: [sporadic] openqa_install_multimachine test fails in test_running - ping test fails auto_review:"Test died: command[\s\S]*openqa-cli api jobs" size:S added
Actions #4

Updated by tinita about 1 month ago · Edited

Looking at the test runtimes:
https://openqa.opensuse.org/tests/4612989#next_previous
it seems that we are back to the numbers from a day before this ticket was created. I wonder if this could be related to the git_auto_clone feature that was enabled and now disabled again. Maybe this resulted in git_clone minion jobs that weren't there before.
I see several error messages:
https://openqa.opensuse.org/tests/4610557/logfile?filename=test_running-journal.log.txt&filter=gru#line-6018

Nov 01 08:08:55 susetest openqa-gru[124548]: [error] Git command failed: env GIT_SSH_COMMAND=ssh -oBatchMode=yes git -C /var/lib/openqa/share/tests/opensuse fetch origin master - Error: error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
6019    Nov 01 08:08:55 susetest openqa-gru[124548]: error: 974 bytes of body are still expected
Nov 01 08:08:55 susetest openqa-gru[124548]: error: 974 bytes of body are still expected
6020    Nov 01 08:08:55 susetest openqa-gru[124548]: fetch-pack: unexpected disconnect while reading sideband packet
6021    Nov 01 08:08:55 susetest openqa-gru[124548]: fatal: early EOF
6022    Nov 01 08:08:55 susetest openqa-gru[124548]: fatal: fetch-pack: invalid index-pack output

And later:
https://openqa.opensuse.org/tests/4610557/logfile?filename=test_running-journal.log.txt&filter=gru#line-6051

Nov 01 08:11:36 susetest openqa-gru[124693]: [info] Running cmd: env GIT_SSH_COMMAND=ssh -oBatchMode=yes git -C /var/lib/openqa/share/tests/opensuse fetch origin master
Nov 01 08:21:08 susetest openqa-gru[124693]: [info] cmd returned 0

Note the duration of almost 10 minutes.
That explains the long overall runtime and in this case seemed to be a slow network problem.

However, we should check if we can disable the git_auto_update for this test to avoid the problem. We didn't need it before.

Currently the default is back to false, but we will enable it soon again.

Actions #5

Updated by mkittler about 1 month ago

  • Subject changed from [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running to [sporadic] [openqa-in-openqa] openqa_install_multimachine test fails in test_running - taking too long until test is running size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #7

Updated by mkittler about 1 month ago · Edited

There are no more failures of that kind which makes sense because I disabled git_auto_update again by default.

It looks like the Git commands into problems and being retried wasn't a problem in itself because both jobs were eventually started:

Nov 01 08:23:09 susetest worker[92165]: [info] +++ setup notes +++
Nov 01 08:23:09 susetest worker[92165]: [info] Running on susetest:1 (Linux 6.11.5-1-default #1 SMP PREEMPT_DYNAMIC Wed Oct 23 04:27:11 UTC 2024 (b4e3aa9) x86_64)
Nov 01 08:23:09 susetest worker[92165]: [info] Preparing cgroup to start isotovideo
Nov 01 08:23:09 susetest worker[92165]: [warn] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.40.0/Mojo/File.pm line 85.
Nov 01 08:23:09 susetest worker[92165]: [info] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.
Nov 01 08:23:09 susetest worker[92165]: [info] Starting isotovideo container
Nov 01 08:23:09 susetest worker[92165]: [info] isotovideo has been started (PID: 125040)
Nov 01 08:23:09 susetest worker[125040]: [info] 125040: WORKING 1
Nov 01 08:23:09 susetest worker[92219]: [info] +++ setup notes +++
Nov 01 08:23:09 susetest worker[92219]: [info] Running on susetest:2 (Linux 6.11.5-1-default #1 SMP PREEMPT_DYNAMIC Wed Oct 23 04:27:11 UTC 2024 (b4e3aa9) x86_64)
Nov 01 08:23:09 susetest worker[92219]: [info] Preparing cgroup to start isotovideo
Nov 01 08:23:09 susetest worker[92219]: [warn] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.40.0/Mojo/File.pm line 85.
Nov 01 08:23:09 susetest worker[92219]: [info] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.
Nov 01 08:23:09 susetest worker[92219]: [info] Starting isotovideo container
Nov 01 08:23:09 susetest worker[92219]: [info] isotovideo has been started (PID: 125042)
Nov 01 08:23:09 susetest worker[125042]: [info] 125042: WORKING 2

Those jobs were terminated prematurely, though. No logs were uploaded, just a few json/png files for the first few modules (boot_to_desktop, bootloader_start, setup_multimachine). Considering the tests reached both setup_multimachine¹ I don't think there was a fundamental problem like the test repo or assets missing. Because no logs have been uploaded I don't know why the test execution was terminated. Neither the outer nor the nested jobs reached the max. job time.

So it is hard to tell why git_auto_update caused this failure. The only indication is the job runtime (33:36 minutes) which is higher than before (< 21 minutes) but doesn't exceed max job time.


¹ I'm not sure whether both tests reached that module because the uploaded logs seem incomplete. Maybe only one reached it. However, https://openqa.opensuse.org/tests/4610557/logfile?filename=test_running-running_processes.txt definitely shows that both tests were started, including VMs and video encoder.

Actions #8

Updated by tinita about 1 month ago

I think the problem is not that the job itself exceeded a runtime, but the scheduling and waiting for the job took all 30 retries, and we can see from the screenshot that the job started to run only at the 30st retry (You can see "state": "scheduled" and then "state": "running"). So the check was never able to see it finished, so it failed, and then the system was probably shutdown before the job was able to finish.

Actions #9

Updated by tinita about 1 month ago

And since we never needed git_auto_update for this test before, we should disable it, because apparently the test already makes sure the git repo is cloned https://openqa.opensuse.org/tests/4610557#step/test_distribution/1 via the initial cloning in the fetchneedles script.
Or we remove the fetchneedles call and rely on git_auto_update. But we should only have it once to avoid such problems.

Since this specific failure was only seen once, I think that's good enough. we have to live with sporadic network problems.

Actions #10

Updated by tinita about 1 month ago · Edited

What I don't understand yet is why the test runtime increased (about 2 minutes) during the phase where git_auto_update was enabled by default. If the repo is already cloned with fetchneedles, another git fetch should be fast.

edit:
https://openqa.opensuse.org/tests/4610689/logfile?filename=test_running-openqa_services.log.txt#line-56 shows that the whole series of git commands took over 3.5 minutes in this passing case...

Actions #11

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback

Right, the retry for checking the job status simply timed out - and that can of course happen without and of the openQA jobs timing out. I suppose disabling git_auto_update would make sense for these kinds of tests as it might just take too long.

However, I guess it is actually good that we patiently retry in general because in production it avoids incompletes. The only other option would be to allow running the openQA job even if the automatic update failed. This would actually also make some kind of sense - we probably don't want GitHub to be single point of failure for our job execution. Before (with fetchneedles) this was good enough as well.

So I can either:

  1. Change tests to disable git_auto_update even though we want to proceed with our plans of enabling it by default. It would be a bit annoying to do this in the test case where we use the container. And for the case where we use the bootstrap script we should probably add an option to that script instead of adding too much logic to the test code.
  2. Make failures to update Git repos non-fatal. Maybe we don't have to change the number of retries for Git updates. However, after e.g. the 2nd failure we simply nevertheless delete the gru dependency and unblock openQA jobs.
Actions #12

Updated by mkittler about 1 month ago

PR for option 2: https://github.com/os-autoinst/openQA/pull/6055

A 3rd option would be to increase the number of retries in the openQA test and the timeout for running retry …. However, it is probably not great to have these tests running for even longer if we can cut it short. (With the current number of retries we would spend alone up to 6.6 minutes waiting/delaying. There are also more timeouts involved when running Git commands that would also factor into it for every retry.)

Actions #13

Updated by livdywan 26 days ago

mkittler wrote in #note-12:

PR for option 2: https://github.com/os-autoinst/openQA/pull/6055

Option 2 was merged.

Actions #14

Updated by mkittler 23 days ago

  • Status changed from Feedback to Resolved

We have disabled git_auto_update by default again due to other problems with it. I think my PR will help once we enable git_auto_update again so I consider this ticket resolved. If this is still problematic once we enable git_auto_update by default again we can reopen it.

Actions #15

Updated by livdywan 23 days ago

  • Related to action #168379: Enable automatic openQA git clone by default size:S added
Actions

Also available in: Atom PDF