action #62015

jobs incomplete without logs as some workers are rejected (was: Scheduler does not work)

Added by nadvornik 3 months ago. Updated 2 months ago.

Status:ResolvedStart date:10/01/2020
Priority:HighDue date:
Assignee:mkittler% Done:

0%

Category:Support
Target version:Done
Difficulty:
Duration:

Description

After update to openQA-4.6.1578604167.a65685352-2147.1 I have random problem with scheduling.

journal contains this:

Jan 10 16:22:19 sleposbuilder2 worker[18329]: [info] 28043: WORKING 70
Jan 10 16:22:19 sleposbuilder2 openqa-websockets[17154]: [warn] Worker 4 accepted job 70 which was never assigned to it or has already finished
Jan 10 16:22:19 sleposbuilder2 openqa[17156]: [info] Got status update for job 70 with unexpected worker ID 4 (expected 2, job is scheduled)
Jan 10 16:22:19 sleposbuilder2 worker[18329]: [error] REST-API error (POST http://sleposbuilder2.suse.cz/api/v1/jobs/70/status): 400 response: Got status update for job 70 with unexpected worker ID 4 (expected 2
, job is scheduled) (remaining tries: 2)

then the job fails.

Postgres log contains this:
2020-01-10 16:22:19.831 CET openqa geekotest [17396]ERROR: duplicate key value violates unique constraint "workers_job_id"
2020-01-10 16:22:19.831 CET openqa geekotest [17396]DETAIL: Key (job_id)=(70) already exists.
2020-01-10 16:22:19.831 CET openqa geekotest [17396]STATEMENT: UPDATE workers SET job_id = $1, t_updated = $2 WHERE ( id = $3 )

Sometimes the same tests work fine.

Full logs attached

logs.tgz - logs (35.4 KB) nadvornik, 10/01/2020 03:32 pm

log.gz - logs (99.5 KB) nadvornik, 13/01/2020 12:56 pm


Related issues

Related to openQA Project - action #37638: Flaky fullstack test: 'Test 3 is scheduled' at t/full-sta... Workable 21/06/2018
Related to openQA Project - action #59043: Fix unstable/flaky full-stack test, i.e. remove sleep Workable 04/11/2019
Related to openQA Project - action #59984: unstable test: t/05-scheduler-full.t Resolved 18/11/2019
Related to openQA Project - action #62243: After latest updates, openQA has problematic behavior on ... Resolved 17/01/2020
Related to openQA Project - action #62984: Fix problem with job-worker assignment resulting in API e... Resolved 03/02/2020

History

#1 Updated by okurz 3 months ago

  • Category set to Support

#2 Updated by okurz 3 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

Can you state which are the previous package versions? E.g. from /var/log/zypp/history find out "last good" and "first bad" version.

You are stating you are having problems with the scheduler but the log refers to job updates. So I assume the jobs actually start up fine but fails afterwards, right?

On http://sleposbuilder2.suse.cz/tests I could not find any green jobs. Have older ones be deleted?

The latest job http://sleposbuilder2.suse.cz/tests/116 states "[2020-01-11T10:10:24.0261 CET] [error] HDD_1 handling Cannot find HDD_1 asset hdd/suma_minion-12-SP4.qcow2!" so the HDD simply does not exist (anymore?).

#3 Updated by nadvornik 3 months ago

  • Status changed from Feedback to New
  • Assignee deleted (okurz)
  • Target version deleted (Current Sprint)

The previously working version was from October 24 when I installed the machine, I can't find exact version. Then I updated to various versions released during January 8-10.

I think the job runs fine on a worker, but the worker ID in db is wrong, so openQA does not accept any results from the job and sets it to failed.

One openQA version from January 8 broke the db update so I started with empty db. As a side effect, gru deleted the assets. But it is not relevant to the problem reported here. The assets will be created again when the install job succeeds.

This bug appears at jobs 70 and 71, the same test worked fine with jobs 60 and 61 (actually it run for several hours then it failed for unrelated reason).

#4 Updated by mkittler 3 months ago

Not sure yet whether wrong scheduling is the cause of this problem (but it might be).

I think the job runs fine on a worker, but the worker ID in db is wrong, so openQA does not accept any results from the job and sets it to failed.

Right, that can be concluded from the log. We don't see this problem in production or anywhere else (right?) so I assume it is a local configuration issue.

Here's another example (from your log):

Jan 10 16:27:02 sleposbuilder2 worker[20622]: [error] REST-API error (POST http://sleposbuilder2.suse.cz/api/v1/jobs/75/status): 400 response: Got status update for job 75 with unexpected worker ID 3 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Jan 10 16:27:02 sleposbuilder2 worker[20622]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20033/f9S4QkUqCvK07XB3/broadcast
Jan 10 16:27:04 sleposbuilder2 worker[20622]: [info] Isotovideo exit status: 0
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [info] +++ worker notes +++
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [info] End time: 2020-01-10 15:27:11
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [info] Result: api-failure
Jan 10 16:27:11 sleposbuilder2 openqa[17156]: [info] Got status update for job 75 with unexpected worker ID 3 (expected no updates anymore, job is done with result incomplete)
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [error] REST-API error (POST http://sleposbuilder2.suse.cz/api/v1/jobs/75/status): 400 response: Got status update for job 75 with unexpected worker ID 3 (expected no updates anymore, job is done with result incomplete) (remaining tries: 1)
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [info] Registering with openQA http://sleposbuilder2.suse.cz
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [info] Establishing ws connection via ws://sleposbuilder2.suse.cz/api/v1/ws/3
Jan 10 16:27:11 sleposbuilder2 openqa-websockets[17154]: [info] Worker 3 websocket connection closed - 1005
Jan 10 16:27:11 sleposbuilder2 worker[20622]: [info] Registered and connected via websockets with openQA host http://sleposbuilder2.suse.cz and worker ID 3
Jan 10 16:27:16 sleposbuilder2 openqa[17156]: [info] Got status update for job 75 with unexpected worker ID 3 (expected no updates anymore, job is done with result incomplete)
Jan 10 16:27:16 sleposbuilder2 worker[20622]: [error] REST-API error (POST http://sleposbuilder2.suse.cz/api/v1/jobs/75/status): 400 response: Got status update for job 75 with unexpected worker ID 3 (expected no updates anymore, job is done with result incomplete) (remaining tries: 0)
Jan 10 16:27:16 sleposbuilder2 worker[20622]: [error] Unable to make final image uploads. Maybe the web UI considers this job already dead.

Apparently the worker is not able to discontinue the job after fatal API errors occur and keeps spamming the web UI with old status information. The re-registration with the web UI works. At least the worker thinks ID 3 has been assigned to it (again). But then the ws connection is immediately closed again and the worker tries to re-register. That's likely because it still does not stop sending the status and runs into the same error again. Seems like the same is happening over and over again.

I would do the following to investigate the problem:

  1. Cancel all scheduled jobs or stop the scheduler.
  2. Stop all workers and just start a single one again. Having just one worker around should make reading the logs easier.
  3. Restart any other daemons as well.
  4. Without the scheduler no jobs should be running. Check whether worker IDs are matching on both sides. Check whether the worker status is displayed correctly within the web UI.
  5. If everything looks fine, schedule a single job again and see how it behaves. If a fatal failure happens, does the worker go back to idle (no more log messages occur)?

#5 Updated by okurz 3 months ago

  • Subject changed from Scheduler does not work to jobs incomplete without logs as some workers are rejected (was: Scheduler does not work)
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

looks like the updates from some workers are rejected, e.g. as seen on http://sleposbuilder2.suse.cz/admin/workers . But some tests seem to run fine, e.g. http://sleposbuilder2.suse.cz/tests/208 and http://sleposbuilder2.suse.cz/tests/209 . You stated "empty db" but I am not sure if there aren't references left which cause inconsistencies now. I suggest you stop all services and worker instances, crosscheck that there are no references to the workers in the database and start them again to register.

#6 Updated by nadvornik 3 months ago

I enabled full debug, rebooted the machine and started 1 worker.
Then the first job - ID:222 ran fine: scheduled -> running -> finished, green
Then the second job - ID:223 behaved strange - in the UI it was scheduled all the time, then it instantly changed state to finished, green

From the log it seems that the worker started executing and sending results long time before scheduler updated the worker table with UPDATE workers SET job_id = ...
(log at time Jan 13 13:38:39)

#7 Updated by nadvornik 3 months ago

Might be related to this:

https://github.com/os-autoinst/openQA/commit/4928543c86dbe204f46f559a621aafcff4c52e36#diff-00673683b9c8f6ce43ccbe117dfada0f

"Remove code for marking the current job of the worker as running
when it connects to the web socket server because on the first
status update it will be considered running anyways"

#8 Updated by nadvornik 3 months ago

As a workaround I tried to add sleep(30) before worker accept_job and now everything seems to run fine, even with multiple workers.

#9 Updated by okurz 3 months ago

  • Assignee changed from okurz to mkittler

@mkittler can you crosscheck?

#10 Updated by mkittler 3 months ago

I'm currently trying to reproduce by cloning one of your jobs. Otherwise I don't know how I would reproduce the problem. Has anybody else being able to observe that problem?

I'm also reading the mentioned commit again but right now I'm not sure whether it can cause that kind of problem. Since you can apparently reproduce the problem you could locally revert the change to see whether that helps.

#11 Updated by nadvornik 3 months ago

lkotek has seen it on his machine too.

My theory is that the scheduler has been broken for long time, but the removed code somehow masked it in the last working version from October.

I think it should be easily reproducible if you add some sleeps to the scheduler.

Regarding reverting the commit - it is quite large and I don't want to break my machine even more, unless it is really necessary.

#12 Updated by mkittler 3 months ago

Ok. I had a look at your log and picked job 223 as an example:

Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] need to schedule 1 jobs for 223(50)
...
Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] [Job#223] Prepare for being processed by worker 1
...
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [199d6984] POST "/api/send_job"
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [199d6984] Routing to controller "OpenQA::Shared::Controller::Auth" and action "check"
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [199d6984] Routing to controller "OpenQA::WebSockets::Controller::API" and action "send_job"
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [pid:2686] Started to send message to 1 for job 223
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [199d6984] 200 OK (0.000946s, 1057.082/s)
Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] Sent job(s) '223' to worker '1'
...
Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] [DBIC] Took 0.00049000 seconds: UPDATE jobs SET assigned_worker_id = ?, state = ? WHERE ( id = ? ): '1', 'assigned', '223'
...
Jan 13 13:38:39 sleposbuilder2 worker[3057]: [info] 3322: WORKING 223
Jan 13 13:38:39 sleposbuilder2 openqa[2687]: [debug] [54e16c05] POST "/api/v1/jobs/223/status"
...
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [warn] [pid:2686] Worker 1 accepted job 223 which was never assigned to it or has already finished
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [pid:2686] Received from worker "1" worker_status message "---
...
Jan 13 13:38:39 sleposbuilder2 openqa[2687]: [debug] [54e16c05] Routing to controller "OpenQA::WebAPI::Controller::API::V1::Job" and action "update_status"
Jan 13 13:38:39 sleposbuilder2 openqa[2687]: [debug] [pid:2781] [DBIC] Took 0.00013100 seconds: SELECT me.id, me.result_dir, me.state, me.priority, me.result, me.clone_id, me.blocked_by_id, me.backend, me.backend_info, me.TEST, me.DISTRI, me.VERSION, me.FLAVOR, me.ARCH, me.BUILD, me.MACHINE, me.group_id, me.assigned_worker_id, me.t_started, me.t_finished, me.logs_present, me.passed_module_count, me.failed_module_count, me.softfailed_module_count, me.skipped_module_count, me.externally_skipped_module_count, me.scheduled_product_id, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '223'
Jan 13 13:38:39 sleposbuilder2 openqa[2687]: [debug] [pid:2781] [DBIC] Took 0.00034800 seconds: SELECT me.id, me.host, me.instance, me.job_id, me.upload_progress, me.error, me.t_created, me.t_updated FROM workers me WHERE ( me.job_id = ? ): '223'
Jan 13 13:38:39 sleposbuilder2 openqa[2687]: [info] [pid:2781] Got status update for job 223 and worker 1 but there is not even a worker assigned to this job (job is scheduled)
Jan 13 13:38:39 sleposbuilder2 openqa[2687]: [debug] [54e16c05] 400 Bad Request (0.00681s, 146.843/s)
Jan 13 13:38:39 sleposbuilder2 worker[3057]: [error] REST-API error (POST http://sleposbuilder2.suse.cz/api/v1/jobs/223/status): 400 response: Got status update for job 223 and worker 1 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 2)
...
...
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [pid:2686] Job  223 reset to state scheduled
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [pid:2686] [DBIC] Took 0.00017700 seconds: SELECT me.id, me.host, me.instance, me.job_id, me.upload_progress, me.error, me.t_created, me.t_updated FROM workers me WHERE ( me.job_id = ? ): '223'
Jan 13 13:38:39 sleposbuilder2 openqa-websockets[2686]: [debug] [pid:2686] [DBIC] COMMIT
Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] [DBIC] Took 0.04916300 seconds: UPDATE workers SET job_id = ?, t_updated = ? WHERE ( id = ? ): '223', '2020-01-13 12:38:39', '1'
Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] [DBIC] Took 0.00034000 seconds: SELECT me.id, me.result_dir, me.state, me.priority, me.result, me.clone_id, me.blocked_by_id, me.backend, me.backend_info, me.TEST, me.DISTRI, me.VERSION, me.FLAVOR, me.ARCH, me.BUILD, me.MACHINE, me.group_id, me.assigned_worker_id, me.t_started, me.t_finished, me.logs_present, me.passed_module_count, me.failed_module_count, me.softfailed_module_count, me.skipped_module_count, me.externally_skipped_module_count, me.scheduled_product_id, me.t_created, me.t_updated FROM jobs me WHERE ( me.id = ? ): '223'
Jan 13 13:38:39 sleposbuilder2 openqa-scheduler[2685]: [debug] [pid:2685] Job '223' has worker '1' assigned

So that's what's happening (at least for job 223):

  1. The scheduler sets the assigned_worker_id (of the job) before the job is passed to the worker. That is supposed to be sufficient. The UPDATE workers SET job_id = ... is also supposed to happen but shouldn't be required in 2..
  2. The worker picks up the job accepting it via the web socket server and sending a status update to the web UI. The log of the web socket server and the web UI shows that these services are not aware of the assignment done in 1..
  3. The other errors are just a result of 2..
  4. The websocket server sets the job back to scheduled. For the websocket server and web UI it looks like the job hasn't been touched by a worker because all attempts from the worker to communicate have been rejected in 2. and 3.. So this makes actually kind of sense.
  5. The worker hasn't given up on the job at this point.
  6. Only at this point we here from the scheduler again. It doesn't know what happened so far so it continues doing the next step of the scheduling logic which is the UPDATE workers SET job_id = ....
  7. The following worker status updates are accepted. Even though it has been set back to scheduled in 4. the update from 6. allows that.

So the culprit is 2. although it is strange that the scheduler so much time for 6.. The problem in 2. might be caused by a race condition when accessing the database. I'm wondering what happens when I remove the code which ran in 6.. Maybe then I can reproduce the issue although that should actually work as well because the web socket server and the web UI have code that does this assignment as well if they see that the assignment from 1. happened.

By the way, that's the code which runs on the scheduler:

sub set_assigned_worker {
    my ($self, $worker) = @_;
    return 0 unless $worker;

    # that's 1.
    $self->update(
        {
            state              => ASSIGNED,
            t_started          => undef,
            assigned_worker_id => $worker->id,
        });

    # that's 6. (so 2. to 5. happened in between)
    $worker->update({job_id => $self->id});

    if ($worker->job->id eq $self->id) {
        log_debug("Job '" . $self->id . "' has worker '" . $worker->id . "' assigned");
        return 1;
    }
    else {
        return 0;
    }
}

#13 Updated by mkittler 3 months ago

I suppose the problem would go away if the scheduler sets the worker-job relations within the database before sending the job to the worker. If sending the job to the worker doesn't work it can still handle the error and undo the worker-job relation again. The error handling is already in place anyways because other things need already cleanup.

I would also remove the line $worker->update({job_id => $self->id}); completely from the scheduler. As already explained the web socket server and web UI handle this as well. I've also just tested locally that it works without it.

The weird error checking if ($worker->job->id eq $self->id) { should be removed as well. DBIx would throw an exception if the update doesn't work.

Note that the issue becomes consistently reproducible by inserting a sleep 10; after return 0 unless $worker; to simulate that the scheduler is slower with updating the db table.

#15 Updated by okurz 3 months ago

  • Related to action #37638: Flaky fullstack test: 'Test 3 is scheduled' at t/full-stack.t added

#16 Updated by okurz 3 months ago

  • Related to action #59043: Fix unstable/flaky full-stack test, i.e. remove sleep added

#17 Updated by okurz 3 months ago

  • Related to action #59984: unstable test: t/05-scheduler-full.t added

#18 Updated by okurz 3 months ago

I linked three tickets that could be related, being about tests involving the scheduler and a job does not always end up in the expected state. I recommend we also trigger the formerly "unstable" tests often enough to see if the specific fix you provided changes the behaviour.

#19 Updated by okurz 3 months ago

  • Related to action #62243: After latest updates, openQA has problematic behavior on Dell Precision 5810 added

#20 Updated by mkittler 3 months ago

The PR has been merged. Since I couldn't reproduce the issue myself (without explicitly simulating a slow scheduler) it would be good if you could try out the new version and check whether the problem is actually gone.

#21 Updated by nadvornik 2 months ago

I have updated the machine, so far it works fine.

#22 Updated by okurz 2 months ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

As o3 is fine and nadvornik also reported a positive feedback I guess we can call it done. Thanks mkittler for the fast reaction and good investigation!

#23 Updated by mkittler 2 months ago

  • Related to action #62984: Fix problem with job-worker assignment resulting in API errors added

Also available in: Atom PDF