Project

General

Profile

Actions

action #128267

closed

coordination #103944: [saga][epic] Scale up: More robust handling of diverse infrastructure with varying performance

coordination #98463: [epic] Avoid too slow asset downloads leading to jobs exceeding the timeout with or run into auto_review:"(timeout: setup exceeded MAX_SETUP_TIME|Cache service queue already full)":retry

Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M

Added by okurz over 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Still lot of "cache queue full" errors, reported in https://suse.slack.com/archives/C02CANHLANP/p1682406454494569 by dimstar:

(Dominique Leuenberger) Seems this kind of error is back (or more active agani as it used to be in the last few weeks: https://openqa.opensuse.org/tests/3243495
Reason: asset failure: Failed to download opensuse-Tumbleweed-x86_64-20230424-textmode@64bit.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-Tumbleweed-x86_64-20230424-textmode@64bit.qcow2; I thought it was addressed? (at least it felt like, as it dod not appear for a while now. Might just have been lucky though)
(Dominique Leuenberger) The start of the fail chain seems to be in https://openqa.opensuse.org/tests/3243518
Reason: cache failure: Cache service queue already full (5)
Cloned as 3243726
(the auto-cloine not taking the children into account is known and unfixed)
(Fabian Vogt) This "Cache service queue already full" error is highly annoying
Every time a worker starts with a clear cache the first dozen tests fail with that
Maybe the queue just needs to be grown 10x or something...
(Dominique Leuenberger) ah, then the luck was probably that the snapshot moved to QA in the late evening, not early morning; so I happened to not be the first consumer

Acceptance criteria

  • AC1: Restarting one of two independent root jobs (only related indirectly via parallel dependency) is handled well (no job ends up as parallel_failed when it has no direct parallel dependencies, no chained children are executed without their parent being successful)
  • AC2: Restarting jobs (e.g. due to full cache queue) is generally handled well. So use cases similar to AC1 are also covered.

Suggestions

  • Understand why #125276 could not fix the problem
  • Make sure jobs really restart if the cache service queue is full
  • Double- and triple-check jobs visible on https://openqa.opensuse.org
  • Get in touch with dimstar+fvogt to ensure the problem is fully addressed

Files


Related issues 2 (0 open2 closed)

Copied from openQA Project - action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:MRejectedmkittler2021-08-09

Actions
Copied to openQA Project - action #128276: Handle workers with busy cache service gracefully by a two-level wait size:MResolvedmkittler2023-04-25

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #96684: Abort asset download via the cache service when related job runs into a timeout (or is otherwise cancelled) size:M added
Actions #2

Updated by favogt over 1 year ago

IMO the error should not happen in the first place. FWICT when this happens it never even attempted to download anything, it immediately fails the job.

Actions #3

Updated by mkittler over 1 year ago

  • Assignee set to mkittler
Actions #4

Updated by mkittler over 1 year ago

  • Status changed from New to In Progress

Understand why #125276 could not fix the problem

The "service queue already full" problem was only "fixed" by #125276 in the sense that none of such jobs need to be restarted manually anymore. It does not look like this was necessary today:

openqa=# select id, clone_id from jobs where (id > 3244452) and (reason like '%service queue already full%' ) and (select user_id from audit_events where event_data like concat('%', jobs.id, '%') and event = 'job_restart' limit 1) is not null order by id desc;
 id | clone_id 
----+----------
(0 rows)

So my changes for #125276 work in the sense that the "service queue already full" problem shouldn't be a problem of our users. There can still be jobs ending up as incompletes with that reason, though.

Make sure jobs really restart if the cache service queue is full

I don't think there's evidence showing the restart is broken in this case.

Actions #5

Updated by mkittler over 1 year ago

The real problem here is that the asset couldn't be downloaded at the time it was attempted. The asset is now there and the download would also work using the attempted URL from openqaworker19.

The job that was supposed to publish that asset ended up parallel_failed although it has not parallel dependencies (https://openqa.opensuse.org/tests/3243402). This is very strange. Unfortunately the worker logs don't give any clue about this. The job has the reason abort which means the job was stopped via the worker command abort (the constant WORKER_COMMAND_ABORT in the code). However, this command is nowhere used so that's very strange as well. It is actually used when a job is aborted (sub abort).

The scheduler log contains messages referring to that job:

[2023-04-25T07:17:18.663492Z] [warn] [pid:31926] Failed to send data to websocket server, reason: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "job_networks_pkey"
DETAIL:  Key (name, job_id)=(fixed, 3243402) already exists. [for Statement "INSERT INTO job_networks (job_id, name, vlan) VALUES (?, ?, ?)" with ParamValues: 1='3243402', 2='fixed', 3='120'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1582

[2023-04-25T07:17:18.663667Z] [warn] [pid:31926] Failed sending job(s) '3243861' to worker '191': unknown error
[2023-04-25T07:17:18.677411Z] [debug] [pid:31926] Job 3243861 reset to state scheduled
[2023-04-25T07:17:18.682670Z] [debug] [pid:31926] Scheduler took 0.91926s to perform operations and allocated 0 jobs

Before that, the job has been assigned once:

…
[2023-04-25T04:51:34.392885Z] [debug] [pid:31926] Need to schedule 1 parallel jobs for job 3243402 (with priority 45)
…
[2023-04-25T04:51:35.585757Z] [debug] [pid:31926] Assigned job '3243402' to worker ID '202'
[2023-04-25T04:51:35.588647Z] [debug] [pid:31926] [Job#3243402] Prepare for being processed by worker 202
[2023-04-25T04:51:35.714400Z] [debug] [pid:31926] Sent job(s) '3243402' to worker '202'
…
[2023-04-25T04:51:42.479075Z] [debug] [pid:31926] Allocated: { job => 3243402, worker => 202 }
…
[2023-04-25T05:18:48.348529Z] [warn] [pid:31926] Failed to send data to websocket server, reason: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "job_networks_pkey"
DETAIL:  Key (name, job_id)=(fixed, 3243402) already exists. [for Statement "INSERT INTO job_networks (job_id, name, vlan) VALUES (?, ?, ?)" with ParamValues: 1='3243402', 2='fixed', 3='120'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1582

[2023-04-25T05:18:48.348745Z] [warn] [pid:31926] Failed sending job(s) '3243559' to worker '544': unknown error
…
[2023-04-25T05:18:48.625554Z] [warn] [pid:31926] Failed to send data to websocket server, reason: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "job_networks_pkey"
DETAIL:  Key (name, job_id)=(fixed, 3243402) already exists. [for Statement "INSERT INTO job_networks (job_id, name, vlan) VALUES (?, ?, ?)" with ParamValues: 1='3243402', 2='fixed', 3='120'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1582

[2023-04-25T05:18:48.625692Z] [warn] [pid:31926] Failed sending job(s) '3243706' to worker '204': unknown error

The other service logs also contain references to that job but none are interesting.

So it looks like an exception due to a race condition when creating the job network can lead to messed up jobs.

Actions #6

Updated by okurz over 1 year ago

  • Copied to action #128276: Handle workers with busy cache service gracefully by a two-level wait size:M added
Actions #7

Updated by okurz over 1 year ago

favogt wrote:

IMO the error should not happen in the first place. FWICT when this happens it never even attempted to download anything, it immediately fails the job.

yes, that should be addressed in #128276

Actions #8

Updated by mkittler over 1 year ago

  • Subject changed from Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2) to Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2))
  • Description updated (diff)

Ok - I've adjusted the title and AC1 of this ticket.

Actions #9

Updated by mkittler over 1 year ago

Ok, so that's what happened:

  1. Job 1 (the current on in the attached screenshot, https://openqa.opensuse.org/tests/3243518) has been restarted automatically as the cache queue was full. The reason for the restart doesn't matter, though. If a user had clicked the restart button the behavior would have been the same.
    • So this first step is normal behavior and must be handled well.
  2. The chained child of job 1 (job 2, https://openqa.opensuse.org/tests/3243519) is skipped.
    • That is definitely what should happen.
  3. The parallel child of job 2 (job 3, https://openqa.opensuse.org/tests/3243701) is skipped as well.
    • That is still what should happen.
  4. The chained parent of job 3 (job 4, the "other root" in the attached screenshot, https://openqa.opensuse.org/tests/3243402) is restarted. It ends up with the result parallel_restarted because a job being restarted as dependency normally only happens for parallel dependency. However, here it happened because a chained child has been skipped.
    • That supposedly should not have happened. I'll have to ponder more about it though (e.g. what further steps would have looked like if it would not have happened in different cases like job 4 needed to be restarted itself).
    • The code does not account for this situation. It assumes that a depended job that is running must be a parallel dependency. Any chained parent is assumed to have already been finished. Any chained child is assumed to have not been started yet. However, as we have seen, these assumptions are not correct.
    • The result parallel_restarted is not completely wrong, though. In fact, a related parallel job has been restarted. It is just that this dependency is only indirect.
    • If a job is restarted as dependency like job 4 it is aborted. This means no further logs/assets are uploaded. So that job 4 has no logs is actually not a bug.
  5. The chained children of job 4 are not skipped.
    • This is because job 4 is concluded in a state where its result is already known (as parallel_restarted is assigned in 4.). The behavior in this state is an explicit choice and could easily be changed. I'll have to ponder what other problems would arise if we would change it, though.
    • This also means the job networks of those children are never cleaned up.
    • If we would change the behavior by removing defined $new_val{result} in the condition in sub done to run _job_stop_cluster also in this state then the job would without further tweaking end up as parallel_failed. That would supposedly be even more misleading than parallel_restarted.
  6. Related jobs of job 4 are still considered by the scheduler due to 5.
    • This supposedly leads to the errors mentioned in #128267#note-5 because networks haven't been correctly released.
  7. The chained children of job 4 are started at some point. So they run despite the asset that job 4 should have been publishing is missing which explains the 404 error of the cache service download.
    • So the cache service is not at fault here; the asset was really missing at this point.

That means the behavior should likely be changed so:

  1. Only parallel jobs (that are running) are restarted as dependencies.
  2. If a job ends up parallel_restarted its children should be skipped as well right away and not just when the worker reports the job is done. Otherwise, if the worker would not respond anymore for some reason we'd still accidentally execute those children.
  3. Job networks should maybe be release when a job is aborted - or is that too early? Maybe the scheduler code should also be able to deal with the situation of already existing job networks better.
Actions #10

Updated by openqa_review over 1 year ago

  • Due date set to 2023-05-10

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan over 1 year ago

  • Subject changed from Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) to Restarting jobs (e.g. due to full cache queue) can lead to weird behavior for certain job dependencies (was: Ensure that the incomplete jobs with "cache service full" are properly restarted (take 2)) size:M
  • Description updated (diff)
Actions #12

Updated by mkittler over 1 year ago

Only parallel jobs (that are running) are restarted as dependencies.

After more pondering I came to the conclusion that this doesn't actually make completely sense. If the chained parent would have failed it would make sense to restart it as well. What we actually needed is avoiding the restart if the parent is still pending. This is what coincidentally the PR https://github.com/os-autoinst/openQA/pull/4962 provides.

I have additionally created https://github.com/os-autoinst/openQA/pull/5106 to reproduce the situation in unit tests and verify that https://github.com/os-autoinst/openQA/pull/4962 fixes it. It also contains another small change to avoid a possible race condition (that we haven't seen in production so far, though). I have also added a test case to see what happens if the 2nd root needed to be restarted after all. This luckily just works without further modifications.


I will still check whether implementing the other two points mentioned in my previous comment make sense. They are about follow-up problems we shouldn't see anymore after fixing the "first place things went wrong". However, for the sake of general reliability they could still be improved.

Actions #13

Updated by mkittler over 1 year ago

Point 2. is already be covered by https://github.com/os-autoinst/openQA/pull/5106/files#diff-1c4e8d9bc0096026bc67314675374924c7a5da02359b426a1c6401eac9e90abe for the case of automatic restarting. I've created another PR for the case of manual restarting: https://github.com/os-autoinst/openQA/pull/5107


About point 3.: We're actually still seeing the error about the network in the scheduler log:

[2023-04-25T07:17:18.663492Z] [warn] [pid:31926] Failed to send data to websocket server, reason: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "job_networks_pkey"
DETAIL:  Key (name, job_id)=(fixed, 3243402) already exists. [for Statement "INSERT INTO job_networks (job_id, name, vlan) VALUES (?, ?, ?)" with ParamValues: 1='3243402', 2='fixed', 3='120'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1582

[2023-04-25T07:17:18.663667Z] [warn] [pid:31926] Failed sending job(s) '3243861' to worker '191': unknown error

This is actually also happening on OSD:

[2023-04-28T08:34:43.469209+02:00] [warn] [pid:10678] Failed to send data to websocket server, reason: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  duplicate key value violates unique constraint "job_networks_pkey"
DETAIL:  Key (name, job_id)=(fixed, 10959941) already exists. [for Statement "INSERT INTO job_networks (job_id, name, vlan) VALUES (?, ?, ?)" with ParamValues: 1='10959941', 2='fixed', 3='99'] at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 1581

Maybe I'll have to remove problematic rows from the databases manually. However, it would also be nice to come up with a code change to prevent this in the future.

Querying the job networks shows nothing special, the row simply already exists (and our code assumes it does not):

openqa=# select * from job_networks where job_id = 3243402 ;
 name  | job_id  | vlan 
-------+---------+------
 fixed | 3243402 |  117
(1 row)

So at some point the cleanup of networks must go wrong.

Actions #14

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback

I haven't seen these log messages anymore in recent o3 and OSD logs. So the situation is eventually self-resolving. However, the code for allocating job networks is very wrong:

Using a transaction makes not really sense. It has been added by https://github.com/os-autoinst/openQA/commit/3c52abbe3d6364c02f73c6f9fe4afe44f89688f6 because find_or_create does not handle concurrent insertions. However, I don't see that we actually do any concurrent insertions. Let's assume I'm wrong and we do them. Then it still doesn't make sense because the error is not handled correctly. If the transaction would fail the code would simply attempt the find-insertion again. That would work but lead to the situation where $created is falsy. In this case the for-loop would continue forever at the return in the nested if-block is never reached. In case a fatal error occurs it would also loop forever because the next; doesn't work inside the catch block¹. So supposedly we never really run into these problematic code paths. I have simplified the code to avoid the transaction but kept an explicit error message in case the network already exists (so in case this really does happen we can easily distinguish this case and possibly tweak the behavior as needed).

When creating networks for the whole cluster a network may already exist. That's what the error messages from my previous comment are about. I guess it is ok to simply not treat this as an error. Likely the job is in a state where it doesn't need a different network allocated anymore anyways.

I've also added a test case to see what would generally happen if job networks are allocated again when they haven't been properly released before. This is generally handled well. So I guess we don't have to make releasing job networks more reliable.

PR with improvements for job networks: https://github.com/os-autoinst/openQA/pull/5111


¹

main::(-e:1):   1
  DB<1> use Try::Tiny;

  DB<2> try { die 'foo' } catch { print "caught: $_\n" ; next };
caught: foo at (eval 8)[/usr/lib/perl5/5.36.0/perl5db.pl:742] line 2.

  DB<3> for $i (1..4) { try { die "foo: $i" } catch { print "caught: $_\n" ; next }; }
caught: foo: 1 at (eval 9)[/usr/lib/perl5/5.36.0/perl5db.pl:742] line 2.
caught: foo: 2 at (eval 9)[/usr/lib/perl5/5.36.0/perl5db.pl:742] line 2.
caught: foo: 3 at (eval 9)[/usr/lib/perl5/5.36.0/perl5db.pl:742] line 2.
caught: foo: 4 at (eval 9)[/usr/lib/perl5/5.36.0/perl5db.pl:742] line 2.
Actions #15

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

All PRs have been merged. I think that's good enough for now. The restarting itself has been fixed and the handling of networks has been improved as well. I've checked jobs in production but there's no further example. I suppose testing this via unit tests must suffice. The same counts for the problem with allocating networks.

Actions #16

Updated by okurz 3 months ago

  • Due date deleted (2023-05-10)
Actions

Also available in: Atom PDF