action #42986

parallel jobs not reliable killed/restarted

Added by coolo over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:28/10/2018
Priority:NormalDue date:
Assignee:coolo% Done:

0%

Category:Concrete Bugs
Target version:Current Sprint
Difficulty:
Duration:

Description

https://openqa.suse.de/tests/2219200#dependencies

rsyslog-client (https://openqa.suse.de/tests/2216925) is cancelled, but restarted as https://openqa.suse.de/tests/2219198, which failed
rsyslog-server (https://openqa.suse.de/tests/2216935) is failed, but restarted as https://openqa.suse.de/tests/2219199, which is cancelled/skipped (got skipped before 2216947 has been canceled manually)
rsyslog-controller (https://openqa.suse.de/tests/2216947) is canclled, but restarted as https://openqa.suse.de/tests/2219200, which stayed scheduled (it is right now canceled because @coolo canceled it manually)

I assume this is due to scheduler and/or cloning not working reliably in transactions. But it will be a puzzle to sort out


Related issues

Related to openQA Project - action #47117: [epic] Fix worker->websocket->scheduler->webui connection Resolved 04/02/2019

History

#1 Updated by coolo over 1 year ago

there were 7 scheduled cluster jobs in total - and all were 2 days old. So it's possible some worker restarted and created this mess.

#2 Updated by mkittler over 1 year ago

  • Assignee set to mkittler

Right now we have jobs which have been in scheduled state for months, eg. https://openqa.suse.de/tests/2278614. Some scheduled jobs even have been cloned, eg. https://openqa.suse.de/tests/2307004.

I would not delete those jobs from OSD right now because they might be helpful to understand the cause.

#3 Updated by mkittler over 1 year ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Target version changed from Ready to Current Sprint

Seems like https://openqa.suse.de/tests/2216947 has been restarted manually by a user which also caused the other restarts in the cluster:

openqa=# select event, event_data, users.nickname from audit_events join users on user_id = users.id where event_data like '%2216947%';
    event    |                                    event_data                                     |    nickname    
-------------+-----------------------------------------------------------------------------------+----------------
 job_create  | {"id":2216947}                                                                    | sle-qam-openqa
 job_restart | {"result":{"2216947":2219200,"2216925":2219198,"2216935":2219199},"id":"2216947"} | osukup
(2 Zeilen)

So that the restarts happened is not a bug. However, the jobs being stuck in the scheduled state which we still see on OSD right now are. Not sure 2219199 has been cancelled.

#4 Updated by mkittler over 1 year ago

  • Description updated (diff)

#5 Updated by mkittler over 1 year ago

Some of the jobs which are currently stuck in scheduled state are not part of the problem. For instance https://openqa.suse.de/tests/2300326#dependencies and https://openqa.suse.de/tests/2300326#dependencies are simply stuck because they require the worker class 64bit-mlx_con5 but only worker with class 64bit-mlx_con5_DISABLED are available on OSD. It seems that somebody intentionally postponed the execution of these jobs.

#6 Updated by mkittler over 1 year ago

The problem that one job in the parallel cluster is stuck in scheduled state (or get canceled/skipped) can be observed also without previous restart. So likely the problem lies in the scheduling and not the restart. For instance slenkins-tests-git-control@64bit is one of those tests which sometimes are stuck in scheduled state (https://openqa.suse.de/tests/2353721, https://openqa.suse.de/tests/2351382, https://openqa.suse.de/tests/2349564). But it is a sporadic issue, the test also gets executed quite often and then passes (https://openqa.suse.de/tests/2357039). Speaks for a bug in the scheduler causing clusters only partially scheduled.

#7 Updated by mkittler over 1 year ago

Theory: The job which is stuck in scheduled state is excluded by OpenQA::Scheduler::Scheduler::update_scheduled_jobs because a Gru task it depends on is still running. That function calls $job->cluster_jobs to compute the cluster considering all jobs. But if I read the code correctly it will in the end nevertheless only consider those jobs without a Gru task.

#8 Updated by coolo about 1 year ago

  • Assignee changed from mkittler to coolo

I killed all old jobs and increased debug level - and we caught another one indeed:

https://openqa.suse.de/tests/2364271 failed, set https://openqa.suse.de/tests/2364272 to parallel_failed but https://openqa.suse.de/tests/2364326 stayed scheduled

#9 Updated by coolo about 1 year ago

[2019-01-10T02:11:14.0502 CET] [debug] [pid:7423] Allocated: { job => 2364271, worker => 575 }
[2019-01-10T02:11:14.0501 CET] [debug] [pid:7423] Allocated: { job => 2364272, worker => 595 }
[2019-01-10T02:11:14.0504 CET] [debug] [pid:7423] Allocated: { job => 2364326, worker => 896 }

So that part still worked

#10 Updated by coolo about 1 year ago

But https://openqa.suse.de/admin/workers/896 reported the last job 29 days ago - it looks broken.

#11 Updated by coolo about 1 year ago

That worker host in general is troubled for a while it seems:

Jan 10 07:25:16 openqaworker3 worker[21681]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: 500. Apache modules proxy_wstunnel and rewrite enabled?
Jan 10 07:25:17 openqaworker3 worker[21679]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: 500. Apache modules proxy_wstunnel and rewrite enabled?

#12 Updated by coolo about 1 year ago

So the next thing that happened, was:

[2019-01-10T02:11:12.0381 CET] [debug] [pid:4246] message sent to 896 for job 2364326
[2019-01-10T02:11:54.0096 CET] [debug] [pid:4246] Found Job(2364326) in DB from worker_status update sent by Worker(896)
[2019-01-10T02:11:54.0127 CET] [debug] [pid:4246] Job '2364326' reset to 'scheduled' state

and then it was assigned to 633, which also returned the favor:

[2019-01-10T02:15:39.0730 CET] [debug] [pid:4246] message sent to 633 for job 2364326
[2019-01-10T02:16:50.0150 CET] [debug] [pid:4246] Found Job(2364326) in DB from worker_status update sent by Worker(633)
[2019-01-10T02:16:50.0176 CET] [debug] [pid:4246] Job '2364326' reset to 'scheduled' state

So the job was 'assigned' twice - and the parallel_failed should have happened here:

[2019-01-10T02:16:20.0219 CET] [debug] POST "/api/v1/jobs/2364271/set_done" (9e769989)

And at that time the job was assigned to 633 - and once 633 gave up, the job was 'reset to scheduled'. And that's what I suspected from the beginning.

#13 Updated by coolo about 1 year ago

The problem is that there are far more bugs in play here:

Jan 10 02:16:20 openqaworker3 worker[21693]: [warn] Received command cancel for different job id 2364326 (our 2363137). Ignoring!

Why did worker3:10 got 326 assigned if it wasn't free? Why is worker3:1 dead for a month without us noticing, while worker3:10 is fine? Why did the scheduler not realize the cluster is incomplete?

#14 Updated by mkravec about 1 year ago

Jan 10 07:25:16 openqaworker3 worker[21681]: [error] Unable to upgrade connection for host "openqa.suse.de" to WebSocket: 500. Apache modules proxy_wstunnel and rewrite enabled?

I noticed those messages on my remote worker too. Worker can't reconnect (Unable to upgrade connection..) after "systemctl restart openqa-worker@1" until I reboot server with webui.
No idea what's the cause.

#15 Updated by coolo about 1 year ago

the full story is even more complex:

[2019-01-10T02:11:54.0080 CET] [debug] [pid:4246] Received from worker "896" worker_status message "$VAR1 = {
[2019-01-10T02:11:54.0083 CET] [debug] [pid:4246] Updated worker seen from worker_status
[2019-01-10T02:11:54.0096 CET] [debug] [pid:4246] Found Job(2364326) in DB from worker_status update sent by Worker(896)
[2019-01-10T02:11:54.0096 CET] [debug] [pid:4246] Received request has id: 2326956
[2019-01-10T02:11:54.0110 CET] [debug] [pid:4246] Worker(896) for Job(2364326) has token u20HTTwQ1lLTrpi2
[2019-01-10T02:11:54.0110 CET] [debug] [pid:4246] Received request has token: BWDiHETB9O3uLSLD
[2019-01-10T02:11:54.0112 CET] [debug] [pid:4246] Possibly worker 896 should be freed.
[2019-01-10T02:11:54.0127 CET] [debug] [pid:4246] Job '2364326' reset to 'scheduled' state

So worker said it's working on 2326956, websocket said 'WTF you should work on 2364326', canceled the job and the worker said 'hey, why? I work on 2363137'.

Very confused workers :(

#16 Updated by coolo about 1 year ago

  • Status changed from In Progress to Feedback

This is potentially the fix, but I miss the time to verify with test cases (as several components are involved)

https://github.com/os-autoinst/openQA/pull/1951

#17 Updated by coolo about 1 year ago

I really thought hard on the cycling tour home how to avoid this mess - and our options are really limited, but I guess we should try :(

#18 Updated by coolo about 1 year ago

  • Status changed from Feedback to Resolved

doesn't appear anymore

#19 Updated by coolo about 1 year ago

  • Related to action #47117: [epic] Fix worker->websocket->scheduler->webui connection added

Also available in: Atom PDF