Project

General

Profile

Actions

action #125276

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Ensure that the incomplete jobs with "cache service full" are properly restarted size:M

Added by okurz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-03-02
Due date:
% Done:

0%

Estimated time:

Description

Motivation

We have been getting reports of jobs with "cache service full" having to be manually restarted, which is causing quite a bit of work for some people.

Acceptance criteria

  • AC1: Users don't see incomplete jobs with "cache service full" on their build status page, e.g. jobs are automatically restarted

Acceptance tests

  • AC1: Redo queries from #125276#note-3 to see that all "cache service full" jobs have been cloned and none had to be cloned by users

Suggestion

  • Make error of automatic invocation of auto_duplicate somewhere visible, e.g. by doing the restart within a Minion job (which then might fail and has the error message)
  • Fix the actual issue of restarting once it could be pinned down
  • Make sure the new Minion jobs don't fail later in production once this problem has been resolved
Actions #1

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #2

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress
Actions #3

Updated by mkittler over 1 year ago

The auto_clone_regex from our default config should cover this. The default is also used on OSD and o3.

All relevant jobs have been cloned on o3:

openqa=# select id, clone_id from jobs where reason like '%ache service queue already full%' and clone_id is null order by id desc;
 id | clone_id 
----+----------
(0 rows)

Of course these jobs might have been restarted by users. There were in fact 5 recent jobs which incompleted with cache failure: Cache service queue already full … and were restarted manually:

openqa=# select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
   id    | clone_id 
---------+----------
 3150461 |  3151096
 3150447 |  3151068
 3150398 |  3151069
 3150381 |  3151088
 3150362 |  3151055
(5 rows)

So the auto_clone_regex feature doesn't work very well. However, it generally does work because the same query for jobs that have been automatically restarted (no audit event recorded) reveals many more jobs:

openqa=# select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is null order by id desc;
   id    | clone_id 
---------+----------
 3152598 |  3152600
 3152028 |  3152034
 3151805 |  3152025
 3151800 |  3152032
 3151741 |  3152007
 3151740 |  3152016
 3151733 |  3152035
 3151727 |  3152012
 3151726 |  3152005
 3151716 |  3152033
 3151715 |  3152015
 3151708 |  3152028
 3151697 |  3152021
 3151694 |  3152023
 3151690 |  3152006
 3151689 |  3152026
 3151262 |  3151268
 3151261 |  3151267
 3151254 |  3151269
 3150801 |  3150806
 3150774 |  3150808
 3150768 |  3150812
 3150754 |  3150807
 3150749 |  3150816
 3150744 |  3150746
 3150738 |  3150851
 3150723 |  3150737
 3150710 |  3150768
 3150696 |  3150799
 3150681 |  3150749
 3150676 |  3150710
 3150674 |  3150853
 3150486 |  3150871
 3150478 |  3150872
 3150476 |  3150703
 3150474 |  3150723
 3150467 |  3150676
 3150452 |  3150875
 3150445 |  3150689
 3150444 |  3150730
 3150442 |  3150680
 3150440 |  3150748
 3150414 |  3150747
 3150412 |  3150675
 3150411 |  3150673
 3150408 |  3150774
 3150400 |  3150754
 3150399 |  3150696
 3150395 |  3150785
 3150394 |  3150745
 3150392 |  3150678
 3150389 |  3150732
 3150387 |  3150837
 3150385 |  3150742
 3150383 |  3150744
 3150378 |  3150679
 3150377 |  3150801
 3150375 |  3150771
 3150374 |  3150738
 3150373 |  3150751
 3150372 |  3150681
 3150367 |  3150674
 3150366 |  3150672
 3150365 |  3150803
 3150363 |  3150725
 3150361 |  3150684
 3150359 |  3150740
 3150358 |  3150779
 3150356 |  3150741
 3150354 |  3150677
 3150351 |  3150735
 3150348 |  3150743
 3150051 |  3150081
 3150049 |  3150083
 3150048 |  3150082
 3150041 |  3150084
 3150035 |  3150080
 3150034 |  3150079
 3150033 |  3150078
 3150032 |  3150077
 3150031 |  3150076
 3150007 |  3150074
(82 rows)

Now the question is: Why does the auto_clone_regex feature sometimes work and sometimes doesn't? The incomplete reason of those jobs is really identical.

Actions #4

Updated by mkittler over 1 year ago

The logs for one of the manually restarted jobs looks like this:

martchus@ariel:~> xzgrep 3150461 /var/log/openqa* 
/var/log/openqa:  "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit",
/var/log/openqa:  "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs",
/var/log/openqa:  path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T08:48:19.678603Z] [debug] [pid:1818] Started to send message to 517 for job(s) 3150461
/var/log/openqa.3.xz:[2023-03-01T08:48:19.775734Z] [debug] [pid:1818] Worker 517 accepted job 3150461
/var/log/openqa.3.xz:  3150461 => {
/var/log/openqa.3.xz:[2023-03-01T08:48:40.460536Z] [debug] [pid:15421] _carry_over_candidate(3150461): _failure_reason=GOOD
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit",
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs",
/var/log/openqa.3.xz:  path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T08:49:06.985123Z] [debug] Restarting job 3150461
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit",
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs",
/var/log/openqa.3.xz:  path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T11:01:11.703255Z] [debug] Restarting job 3150461
/var/log/openqa.3.xz:  3150461 => {
/var/log/openqa.3.xz:[2023-03-01T11:01:12.021846Z] [debug] [pid:6405] Job 3150461 duplicated as 3151096

Not sure whether we'd see errors for a failed automatic auto_duplicate invocation. We definitely see that the job has been restarted via the API (Restarting job is only logged by the API controller, not when restarting automatically).

For another job the restarting via API is logged twice:

martchus@ariel:~> xzgrep 3150447 /var/log/openqa* 
/var/log/openqa.3.xz:[2023-03-01T08:48:03.025112Z] [debug] [pid:1818] Started to send message to 518 for job(s) 3150447
/var/log/openqa.3.xz:[2023-03-01T08:48:03.112224Z] [debug] [pid:1818] Worker 518 accepted job 3150447
/var/log/openqa.3.xz:  3150447 => {
/var/log/openqa.3.xz:[2023-03-01T08:48:34.946664Z] [debug] [pid:17388] _carry_over_candidate(3150447): _failure_reason=GOOD
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit",
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/ulogs",
/var/log/openqa.3.xz:  path => "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T08:48:52.482642Z] [debug] Restarting job 3150447
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit",
/var/log/openqa.3.xz:  "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/ulogs",
/var/log/openqa.3.xz:  path => "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T10:59:23.167754Z] [debug] Restarting job 3150447
/var/log/openqa.3.xz:  3150447 => {
/var/log/openqa.3.xz:[2023-03-01T10:59:23.494619Z] [debug] [pid:2031] Job 3150447 duplicated as 3151068

But there only one Job … duplicated line logged. Maybe the first manual restart attempt failed so the user tried again. This lets me think that we maybe have a general problem with restarts (independently of whether they're triggered automatically or manually) and that those errors are suppressed.

The code that invoked auto_duplicate in case the auto clone regex matches in fact completely ignores any error returned from that function.

When auto_duplicate is invoked the error returned from that function is returned to the user and not logged. That would explain why the log above doesn't contain any errors for the first Restarting job 3150447. Of course this leaves us with no clue what the error actually was.

Actions #5

Updated by openqa_review over 1 year ago

  • Due date set to 2023-03-17

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler over 1 year ago

  • Subject changed from Ensure that the incomplete jobs with "cache service full" are properly restarted to Ensure that the incomplete jobs with "cache service full" are properly restarted size:M
  • Description updated (diff)
Actions #7

Updated by mkittler over 1 year ago

  • Description updated (diff)

PR for improving the error handling: https://github.com/os-autoinst/openQA/pull/5025

I'll do the queries again a while after it has been deployed and also check the Minion dashboard.

Actions #8

Updated by mkittler over 1 year ago

  • Description updated (diff)
Actions #9

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

The PR has been merged. Maybe the retry already helps or we at least have a concrete error message now. I'll do the queries again tomorrow evening or on Friday.

Actions #10

Updated by mkittler over 1 year ago

I couldn't find any failing Minion restart_job jobs except for cases when the job has already been restarted anyways. Those shouldn't actually be failures. I've created https://github.com/os-autoinst/openQA/pull/5031 to fix this.

The list of jobs with reason "cache service full" that had to be restarted manually nevertheless grew longer (this is the list of jobs as of the last job considered last time in #125276125300#note-3):

openqa=# select id, clone_id from jobs where reason like '%ache service queue already full%' and clone_id is null order by id desc;
 id | clone_id 
----+----------
(0 rows)

openqa=# select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
   id    | clone_id |     t_finished      
---------+----------+---------------------
 3162168 |  3162458 | 2023-03-08 03:32:46
 3161970 |  3162402 | 2023-03-08 03:32:44
 3161969 |  3162457 | 2023-03-08 03:32:47
 3161961 |  3162411 | 2023-03-08 03:32:33
 3161953 |  3162428 | 2023-03-08 03:32:43
 3161912 |  3162412 | 2023-03-08 03:32:27
 3157176 |  3157708 | 2023-03-05 12:17:27
 3155695 |  3156172 | 2023-03-04 07:20:41
(8 rows)

The earliest restart_job job was created on 2023-03-08T17:21:40.522945Z so https://github.com/os-autoinst/openQA/pull/5025 was deployed around that time. (The are much older finished Minion jobs so the automatic cleanup of finished jobs can be ignored here.) That means all further jobs revealed by the query had finished before the PR was deployed and are thus not relevant.

So maybe using a Minion job with retry made it more reliably, indeed. I'll re-check next week.

Note that most of these jobs were finishing around the same time, likely when openqa-webui.service restarted (the journal doesn't go far enough to say for sure). A restart of openqa-webui.service would actually explain why the restarting didn't happen. Supposedly that's something the mentioned PR would indeed improve.

Actions #11

Updated by mkittler over 1 year ago

No further occurrences:

openqa=# select id, clone_id from jobs where (id > 3162168) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
 id | clone_id 
----+----------
(0 rows)

The PR https://github.com/os-autoinst/openQA/pull/5031 is still pending so I've been removing failed restart jobs manually for now.

Actions #12

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

There are no failing restart jobs and no further occurrences:

openqa=# select id, clone_id from jobs where (t_created >= '2023-03-13 10:00:00') and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
   id    | clone_id 
---------+----------
 3172696 |  3172803
(1 row)

The query shows one result but that's only because this job itself is the result of another job being manually restarted.

Actions #13

Updated by okurz over 1 year ago

  • Due date deleted (2023-03-17)
Actions

Also available in: Atom PDF