action #125276
closedcoordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance
coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
Ensure that the incomplete jobs with "cache service full" are properly restarted size:M
Added by okurz over 1 year ago. Updated over 1 year ago.
Description
Motivation¶
We have been getting reports of jobs with "cache service full" having to be manually restarted, which is causing quite a bit of work for some people.
Acceptance criteria¶
- AC1: Users don't see incomplete jobs with "cache service full" on their build status page, e.g. jobs are automatically restarted
Acceptance tests¶
- AC1: Redo queries from #125276#note-3 to see that all "cache service full" jobs have been cloned and none had to be cloned by users
Suggestion¶
- Make error of automatic invocation of
auto_duplicate
somewhere visible, e.g. by doing the restart within a Minion job (which then might fail and has the error message) - Fix the actual issue of restarting once it could be pinned down
- Make sure the new Minion jobs don't fail later in production once this problem has been resolved
Updated by mkittler over 1 year ago
The auto_clone_regex
from our default config should cover this. The default is also used on OSD and o3.
All relevant jobs have been cloned on o3:
openqa=# select id, clone_id from jobs where reason like '%ache service queue already full%' and clone_id is null order by id desc;
id | clone_id
----+----------
(0 rows)
Of course these jobs might have been restarted by users. There were in fact 5 recent jobs which incompleted with cache failure: Cache service queue already full …
and were restarted manually:
openqa=# select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
id | clone_id
---------+----------
3150461 | 3151096
3150447 | 3151068
3150398 | 3151069
3150381 | 3151088
3150362 | 3151055
(5 rows)
So the auto_clone_regex
feature doesn't work very well. However, it generally does work because the same query for jobs that have been automatically restarted (no audit event recorded) reveals many more jobs:
openqa=# select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is null order by id desc;
id | clone_id
---------+----------
3152598 | 3152600
3152028 | 3152034
3151805 | 3152025
3151800 | 3152032
3151741 | 3152007
3151740 | 3152016
3151733 | 3152035
3151727 | 3152012
3151726 | 3152005
3151716 | 3152033
3151715 | 3152015
3151708 | 3152028
3151697 | 3152021
3151694 | 3152023
3151690 | 3152006
3151689 | 3152026
3151262 | 3151268
3151261 | 3151267
3151254 | 3151269
3150801 | 3150806
3150774 | 3150808
3150768 | 3150812
3150754 | 3150807
3150749 | 3150816
3150744 | 3150746
3150738 | 3150851
3150723 | 3150737
3150710 | 3150768
3150696 | 3150799
3150681 | 3150749
3150676 | 3150710
3150674 | 3150853
3150486 | 3150871
3150478 | 3150872
3150476 | 3150703
3150474 | 3150723
3150467 | 3150676
3150452 | 3150875
3150445 | 3150689
3150444 | 3150730
3150442 | 3150680
3150440 | 3150748
3150414 | 3150747
3150412 | 3150675
3150411 | 3150673
3150408 | 3150774
3150400 | 3150754
3150399 | 3150696
3150395 | 3150785
3150394 | 3150745
3150392 | 3150678
3150389 | 3150732
3150387 | 3150837
3150385 | 3150742
3150383 | 3150744
3150378 | 3150679
3150377 | 3150801
3150375 | 3150771
3150374 | 3150738
3150373 | 3150751
3150372 | 3150681
3150367 | 3150674
3150366 | 3150672
3150365 | 3150803
3150363 | 3150725
3150361 | 3150684
3150359 | 3150740
3150358 | 3150779
3150356 | 3150741
3150354 | 3150677
3150351 | 3150735
3150348 | 3150743
3150051 | 3150081
3150049 | 3150083
3150048 | 3150082
3150041 | 3150084
3150035 | 3150080
3150034 | 3150079
3150033 | 3150078
3150032 | 3150077
3150031 | 3150076
3150007 | 3150074
(82 rows)
Now the question is: Why does the auto_clone_regex
feature sometimes work and sometimes doesn't? The incomplete reason of those jobs is really identical.
Updated by mkittler over 1 year ago
The logs for one of the manually restarted jobs looks like this:
martchus@ariel:~> xzgrep 3150461 /var/log/openqa*
/var/log/openqa: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit",
/var/log/openqa: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs",
/var/log/openqa: path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T08:48:19.678603Z] [debug] [pid:1818] Started to send message to 517 for job(s) 3150461
/var/log/openqa.3.xz:[2023-03-01T08:48:19.775734Z] [debug] [pid:1818] Worker 517 accepted job 3150461
/var/log/openqa.3.xz: 3150461 => {
/var/log/openqa.3.xz:[2023-03-01T08:48:40.460536Z] [debug] [pid:15421] _carry_over_candidate(3150461): _failure_reason=GOOD
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit",
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs",
/var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T08:49:06.985123Z] [debug] Restarting job 3150461
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit",
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs",
/var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T11:01:11.703255Z] [debug] Restarting job 3150461
/var/log/openqa.3.xz: 3150461 => {
/var/log/openqa.3.xz:[2023-03-01T11:01:12.021846Z] [debug] [pid:6405] Job 3150461 duplicated as 3151096
Not sure whether we'd see errors for a failed automatic auto_duplicate
invocation. We definitely see that the job has been restarted via the API (Restarting job
is only logged by the API controller, not when restarting automatically).
For another job the restarting via API is logged twice:
martchus@ariel:~> xzgrep 3150447 /var/log/openqa*
/var/log/openqa.3.xz:[2023-03-01T08:48:03.025112Z] [debug] [pid:1818] Started to send message to 518 for job(s) 3150447
/var/log/openqa.3.xz:[2023-03-01T08:48:03.112224Z] [debug] [pid:1818] Worker 518 accepted job 3150447
/var/log/openqa.3.xz: 3150447 => {
/var/log/openqa.3.xz:[2023-03-01T08:48:34.946664Z] [debug] [pid:17388] _carry_over_candidate(3150447): _failure_reason=GOOD
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit",
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/ulogs",
/var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T08:48:52.482642Z] [debug] Restarting job 3150447
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit",
/var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/ulogs",
/var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/autoinst-log.txt",
/var/log/openqa.3.xz:[2023-03-01T10:59:23.167754Z] [debug] Restarting job 3150447
/var/log/openqa.3.xz: 3150447 => {
/var/log/openqa.3.xz:[2023-03-01T10:59:23.494619Z] [debug] [pid:2031] Job 3150447 duplicated as 3151068
But there only one Job … duplicated
line logged. Maybe the first manual restart attempt failed so the user tried again. This lets me think that we maybe have a general problem with restarts (independently of whether they're triggered automatically or manually) and that those errors are suppressed.
The code that invoked auto_duplicate
in case the auto clone regex matches in fact completely ignores any error returned from that function.
When auto_duplicate
is invoked the error returned from that function is returned to the user and not logged. That would explain why the log above doesn't contain any errors for the first Restarting job 3150447
. Of course this leaves us with no clue what the error actually was.
Updated by openqa_review over 1 year ago
- Due date set to 2023-03-17
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
- Subject changed from Ensure that the incomplete jobs with "cache service full" are properly restarted to Ensure that the incomplete jobs with "cache service full" are properly restarted size:M
- Description updated (diff)
Updated by mkittler over 1 year ago
- Description updated (diff)
PR for improving the error handling: https://github.com/os-autoinst/openQA/pull/5025
I'll do the queries again a while after it has been deployed and also check the Minion dashboard.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
The PR has been merged. Maybe the retry already helps or we at least have a concrete error message now. I'll do the queries again tomorrow evening or on Friday.
Updated by mkittler over 1 year ago
I couldn't find any failing Minion restart_job
jobs except for cases when the job has already been restarted anyways. Those shouldn't actually be failures. I've created https://github.com/os-autoinst/openQA/pull/5031 to fix this.
The list of jobs with reason "cache service full" that had to be restarted manually nevertheless grew longer (this is the list of jobs as of the last job considered last time in #125276125300#note-3):
openqa=# select id, clone_id from jobs where reason like '%ache service queue already full%' and clone_id is null order by id desc;
id | clone_id
----+----------
(0 rows)
openqa=# select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
id | clone_id | t_finished
---------+----------+---------------------
3162168 | 3162458 | 2023-03-08 03:32:46
3161970 | 3162402 | 2023-03-08 03:32:44
3161969 | 3162457 | 2023-03-08 03:32:47
3161961 | 3162411 | 2023-03-08 03:32:33
3161953 | 3162428 | 2023-03-08 03:32:43
3161912 | 3162412 | 2023-03-08 03:32:27
3157176 | 3157708 | 2023-03-05 12:17:27
3155695 | 3156172 | 2023-03-04 07:20:41
(8 rows)
The earliest restart_job
job was created on 2023-03-08T17:21:40.522945Z so https://github.com/os-autoinst/openQA/pull/5025 was deployed around that time. (The are much older finished Minion jobs so the automatic cleanup of finished jobs can be ignored here.) That means all further jobs revealed by the query had finished before the PR was deployed and are thus not relevant.
So maybe using a Minion job with retry made it more reliably, indeed. I'll re-check next week.
Note that most of these jobs were finishing around the same time, likely when openqa-webui.service
restarted (the journal doesn't go far enough to say for sure). A restart of openqa-webui.service
would actually explain why the restarting didn't happen. Supposedly that's something the mentioned PR would indeed improve.
Updated by mkittler over 1 year ago
No further occurrences:
openqa=# select id, clone_id from jobs where (id > 3162168) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
id | clone_id
----+----------
(0 rows)
The PR https://github.com/os-autoinst/openQA/pull/5031 is still pending so I've been removing failed restart jobs manually for now.
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
There are no failing restart jobs and no further occurrences:
openqa=# select id, clone_id from jobs where (t_created >= '2023-03-13 10:00:00') and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
id | clone_id
---------+----------
3172696 | 3172803
(1 row)
The query shows one result but that's only because this job itself is the result of another job being manually restarted.