action #125276
coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance
coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry
Ensure that the incomplete jobs with "cache service full" are properly restarted size:M
0%
Description
Motivation¶
We have been getting reports of jobs with "cache service full" having to be manually restarted, which is causing quite a bit of work for some people.
Acceptance criteria¶
- AC1: Users don't see incomplete jobs with "cache service full" on their build status page, e.g. jobs are automatically restarted
Acceptance tests¶
- AC1: Redo queries from #125276#note-3 to see that all "cache service full" jobs have been cloned and none had to be cloned by users
Suggestion¶
- Make error of automatic invocation of
auto_duplicate
somewhere visible, e.g. by doing the restart within a Minion job (which then might fail and has the error message) - Fix the actual issue of restarting once it could be pinned down
- Make sure the new Minion jobs don't fail later in production once this problem has been resolved
History
#3
Updated by mkittler 3 months ago
The auto_clone_regex
from our default config should cover this. The default is also used on OSD and o3.
All relevant jobs have been cloned on o3:
openqa=# select id, clone_id from jobs where reason like '%ache service queue already full%' and clone_id is null order by id desc; id | clone_id ----+---------- (0 rows)
Of course these jobs might have been restarted by users. There were in fact 5 recent jobs which incompleted with cache failure: Cache service queue already full …
and were restarted manually:
openqa=# select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; id | clone_id ---------+---------- 3150461 | 3151096 3150447 | 3151068 3150398 | 3151069 3150381 | 3151088 3150362 | 3151055 (5 rows)
So the auto_clone_regex
feature doesn't work very well. However, it generally does work because the same query for jobs that have been automatically restarted (no audit event recorded) reveals many more jobs:
openqa=# select id, clone_id from jobs where (id > 3150000) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is null order by id desc; id | clone_id ---------+---------- 3152598 | 3152600 3152028 | 3152034 3151805 | 3152025 3151800 | 3152032 3151741 | 3152007 3151740 | 3152016 3151733 | 3152035 3151727 | 3152012 3151726 | 3152005 3151716 | 3152033 3151715 | 3152015 3151708 | 3152028 3151697 | 3152021 3151694 | 3152023 3151690 | 3152006 3151689 | 3152026 3151262 | 3151268 3151261 | 3151267 3151254 | 3151269 3150801 | 3150806 3150774 | 3150808 3150768 | 3150812 3150754 | 3150807 3150749 | 3150816 3150744 | 3150746 3150738 | 3150851 3150723 | 3150737 3150710 | 3150768 3150696 | 3150799 3150681 | 3150749 3150676 | 3150710 3150674 | 3150853 3150486 | 3150871 3150478 | 3150872 3150476 | 3150703 3150474 | 3150723 3150467 | 3150676 3150452 | 3150875 3150445 | 3150689 3150444 | 3150730 3150442 | 3150680 3150440 | 3150748 3150414 | 3150747 3150412 | 3150675 3150411 | 3150673 3150408 | 3150774 3150400 | 3150754 3150399 | 3150696 3150395 | 3150785 3150394 | 3150745 3150392 | 3150678 3150389 | 3150732 3150387 | 3150837 3150385 | 3150742 3150383 | 3150744 3150378 | 3150679 3150377 | 3150801 3150375 | 3150771 3150374 | 3150738 3150373 | 3150751 3150372 | 3150681 3150367 | 3150674 3150366 | 3150672 3150365 | 3150803 3150363 | 3150725 3150361 | 3150684 3150359 | 3150740 3150358 | 3150779 3150356 | 3150741 3150354 | 3150677 3150351 | 3150735 3150348 | 3150743 3150051 | 3150081 3150049 | 3150083 3150048 | 3150082 3150041 | 3150084 3150035 | 3150080 3150034 | 3150079 3150033 | 3150078 3150032 | 3150077 3150031 | 3150076 3150007 | 3150074 (82 rows)
Now the question is: Why does the auto_clone_regex
feature sometimes work and sometimes doesn't? The incomplete reason of those jobs is really identical.
#4
Updated by mkittler 3 months ago
The logs for one of the manually restarted jobs looks like this:
martchus@ariel:~> xzgrep 3150461 /var/log/openqa* /var/log/openqa: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit", /var/log/openqa: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs", /var/log/openqa: path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt", /var/log/openqa.3.xz:[2023-03-01T08:48:19.678603Z] [debug] [pid:1818] Started to send message to 517 for job(s) 3150461 /var/log/openqa.3.xz:[2023-03-01T08:48:19.775734Z] [debug] [pid:1818] Worker 517 accepted job 3150461 /var/log/openqa.3.xz: 3150461 => { /var/log/openqa.3.xz:[2023-03-01T08:48:40.460536Z] [debug] [pid:15421] _carry_over_candidate(3150461): _failure_reason=GOOD /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit", /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs", /var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt", /var/log/openqa.3.xz:[2023-03-01T08:49:06.985123Z] [debug] Restarting job 3150461 /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit", /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/ulogs", /var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150461-opensuse-Tumbleweed-DVD-x86_64-Build20230228-security_tpm2_keylime\@64bit/autoinst-log.txt", /var/log/openqa.3.xz:[2023-03-01T11:01:11.703255Z] [debug] Restarting job 3150461 /var/log/openqa.3.xz: 3150461 => { /var/log/openqa.3.xz:[2023-03-01T11:01:12.021846Z] [debug] [pid:6405] Job 3150461 duplicated as 3151096
Not sure whether we'd see errors for a failed automatic auto_duplicate
invocation. We definitely see that the job has been restarted via the API (Restarting job
is only logged by the API controller, not when restarting automatically).
For another job the restarting via API is logged twice:
martchus@ariel:~> xzgrep 3150447 /var/log/openqa* /var/log/openqa.3.xz:[2023-03-01T08:48:03.025112Z] [debug] [pid:1818] Started to send message to 518 for job(s) 3150447 /var/log/openqa.3.xz:[2023-03-01T08:48:03.112224Z] [debug] [pid:1818] Worker 518 accepted job 3150447 /var/log/openqa.3.xz: 3150447 => { /var/log/openqa.3.xz:[2023-03-01T08:48:34.946664Z] [debug] [pid:17388] _carry_over_candidate(3150447): _failure_reason=GOOD /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit", /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/ulogs", /var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/autoinst-log.txt", /var/log/openqa.3.xz:[2023-03-01T08:48:52.482642Z] [debug] Restarting job 3150447 /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit", /var/log/openqa.3.xz: "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/ulogs", /var/log/openqa.3.xz: path => "/var/lib/openqa/testresults/03150/03150447-opensuse-Tumbleweed-DVD-x86_64-Build20230228-extra_tests_dracut\@64bit/autoinst-log.txt", /var/log/openqa.3.xz:[2023-03-01T10:59:23.167754Z] [debug] Restarting job 3150447 /var/log/openqa.3.xz: 3150447 => { /var/log/openqa.3.xz:[2023-03-01T10:59:23.494619Z] [debug] [pid:2031] Job 3150447 duplicated as 3151068
But there only one Job … duplicated
line logged. Maybe the first manual restart attempt failed so the user tried again. This lets me think that we maybe have a general problem with restarts (independently of whether they're triggered automatically or manually) and that those errors are suppressed.
The code that invoked auto_duplicate
in case the auto clone regex matches in fact completely ignores any error returned from that function.
When auto_duplicate
is invoked the error returned from that function is returned to the user and not logged. That would explain why the log above doesn't contain any errors for the first Restarting job 3150447
. Of course this leaves us with no clue what the error actually was.
#5
Updated by openqa_review 3 months ago
- Due date set to 2023-03-17
Setting due date based on mean cycle time of SUSE QE Tools
#7
Updated by mkittler 3 months ago
- Description updated (diff)
PR for improving the error handling: https://github.com/os-autoinst/openQA/pull/5025
I'll do the queries again a while after it has been deployed and also check the Minion dashboard.
#10
Updated by mkittler 3 months ago
I couldn't find any failing Minion restart_job
jobs except for cases when the job has already been restarted anyways. Those shouldn't actually be failures. I've created https://github.com/os-autoinst/openQA/pull/5031 to fix this.
The list of jobs with reason "cache service full" that had to be restarted manually nevertheless grew longer (this is the list of jobs as of the last job considered last time in #125276125300#note-3):
openqa=# select id, clone_id from jobs where reason like '%ache service queue already full%' and clone_id is null order by id desc; id | clone_id ----+---------- (0 rows) openqa=# select id, clone_id, t_finished from jobs where (id > 3150461) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; id | clone_id | t_finished ---------+----------+--------------------- 3162168 | 3162458 | 2023-03-08 03:32:46 3161970 | 3162402 | 2023-03-08 03:32:44 3161969 | 3162457 | 2023-03-08 03:32:47 3161961 | 3162411 | 2023-03-08 03:32:33 3161953 | 3162428 | 2023-03-08 03:32:43 3161912 | 3162412 | 2023-03-08 03:32:27 3157176 | 3157708 | 2023-03-05 12:17:27 3155695 | 3156172 | 2023-03-04 07:20:41 (8 rows)
The earliest restart_job
job was created on 2023-03-08T17:21:40.522945Z so https://github.com/os-autoinst/openQA/pull/5025 was deployed around that time. (The are much older finished Minion jobs so the automatic cleanup of finished jobs can be ignored here.) That means all further jobs revealed by the query had finished before the PR was deployed and are thus not relevant.
So maybe using a Minion job with retry made it more reliably, indeed. I'll re-check next week.
Note that most of these jobs were finishing around the same time, likely when openqa-webui.service
restarted (the journal doesn't go far enough to say for sure). A restart of openqa-webui.service
would actually explain why the restarting didn't happen. Supposedly that's something the mentioned PR would indeed improve.
#11
Updated by mkittler 3 months ago
No further occurrences:
openqa=# select id, clone_id from jobs where (id > 3162168) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; id | clone_id ----+---------- (0 rows)
The PR https://github.com/os-autoinst/openQA/pull/5031 is still pending so I've been removing failed restart jobs manually for now.
#12
Updated by mkittler 3 months ago
- Status changed from Feedback to Resolved
There are no failing restart jobs and no further occurrences:
openqa=# select id, clone_id from jobs where (t_created >= '2023-03-13 10:00:00') and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc; id | clone_id ---------+---------- 3172696 | 3172803 (1 row)
The query shows one result but that's only because this job itself is the result of another job being manually restarted.