Project

General

Profile

Actions

action #20378

closed

[tools]Too many 502 on openqa

Added by coolo over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2017-07-18
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Main purpose of this ticket is to track progress


Subtasks 8 (0 open8 closed)

openQA Project (public) - action #20544: [tools] Research/investigate ways to optimize scheduler grab_jobResolvedEDiGiacinto2017-07-18

Actions
openQA Project (public) - action #20546: [tools] Research/investigate ways to optimize how workers picks jobsResolvedEDiGiacinto2017-07-18

Actions
openQA Project (public) - action #23320: [tools][sprint 201709.2][sprint 201710.1][sprint 201710.2] move locks/mutexes/barriers/job restarts out of schedulerResolvedEDiGiacinto2017-10-11

Actions
openQA Project (public) - action #25930: [tools][sprint 201710.1][sprint 201710.2] Split fullstack tests in different build matrixResolvedEDiGiacinto2017-10-11

Actions
openQA Project (public) - action #25936: [tools][sprint 201710.1][sprint 201710.2] Regression in OpenQA::ServerStartup - even if logfile is supplied, messages are printed in journalResolvedEDiGiacinto2017-10-11

Actions
openQA Project (public) - action #25944: [tools][sprint 201710.1][sprint 201710.2] Make job update take into account of ASSIGNED and SETUP stateResolvedEDiGiacinto2017-10-11

Actions
openQA Project (public) - action #25952: [tools][sprint 201710.1][sprint 201710.2] Reset Job state in Worker also when we loose current_host (due to persistent failures while contacting webserver)ResolvedEDiGiacinto2017-10-11

Actions
openQA Project (public) - action #25954: [tools][sprint 201710.1][sprint 201710.2] Regression: no live streamResolvedEDiGiacinto2017-10-11

Actions

Related issues 10 (0 open10 closed)

Related to openQA Project (public) - action #21006: Very high latency in interactive mode under high loadResolvedEDiGiacinto2017-08-02

Actions
Related to openQA Project (public) - action #19924: job was killed during manual investigation / interactive modeResolvedEDiGiacinto2017-06-20

Actions
Related to openQA Project (public) - action #19332: Growing number of fails due to 'skipped' modulesResolvedEDiGiacinto2017-05-23

Actions
Related to openQA Project (public) - action #10940: openqa.suse.de: repeating error in log "Query returned more than one row"ResolvedEDiGiacinto2016-02-26

Actions
Related to openQA Project (public) - action #17662: [tools]webUI does not load thumbnails for test details in time or at allResolvedEDiGiacinto2017-03-10

Actions
Related to openQA Project (public) - action #17116: [tools]Needle editor displays black screenResolvedEDiGiacinto2017-02-16

Actions
Related to openQA Project (public) - action #21028: Some jobs immediately fail with a 404 (new scheduling logic)Resolved2017-08-03

Actions
Related to openQA Project (public) - action #21818: Two workers can work on the same job at a timeResolvedEDiGiacinto2017-08-07

Actions
Related to openQA Project (public) - action #23378: [tools][sprint 201709.1] Enhancement and cleanup of "assigned" stateResolvedEDiGiacinto2017-08-14

Actions
Has duplicate openQA Project (public) - action #20598: [tools] Upload logs fails because reached maximum attempts limit Rejected2017-07-19

Actions
Actions #1

Updated by coolo over 7 years ago

we have tons of 502 all over the day - and our CPU load is fine, so it needs to be somewhere else.

2620:113:80c0:8080:2247:47ff:fee9:7c86 - - [11/Jul/2017:12:29:15 +0200] "GET /tests/1056880/status HTTP/1.1" 502

Looking at what the workers worked on, there is an evil pattern.

[Tue Jul 11 12:12:26 2017] [22001:debug] GET "/tests/1056880/livelog"
[Tue Jul 11 12:13:10 2017] [21427:debug] Stopping worker 22001 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 22001 stopped


[Tue Jul 11 12:12:26 2017] [23263:debug] GET "/tests/1056880/streaming"
[Tue Jul 11 12:13:31 2017] [21427:debug] Stopping worker 23263 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 23263 stopped


[Tue Jul 11 12:21:35 2017] [27304:debug] GET "/tests/1056880/livelog"
[Tue Jul 11 12:24:07 2017] [21427:debug] Stopping worker 27304 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 27304 stopped


[Tue Jul 11 12:21:35 2017] [27407:debug] GET "/tests/1056880/streaming"
[Tue Jul 11 12:23:46 2017] [21427:debug] Stopping worker 27407 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 27407 stopped

So one live view of mnovak 'killed' 4/20 workers.

But another worker was fine and appearing idle:

[Tue Jul 11 12:29:13 2017] [28236:debug] dispatching IPC job_grab to scheduler: [
[Tue Jul 11 12:29:16 2017] [28236:debug] IPC finished
[Tue Jul 11 12:29:16 2017] [28236:debug] 200 OK (2.934915s, 0.341/s)
[Tue Jul 11 12:29:25 2017] [28236:debug] POST "/api/v1/workers/627/grab_job"

The job_grab took 3 second - and most likely blocking, but after that it appears to be free.

A similiar pattern here:

[Tue Jul 11 12:29:16 2017] [28398:debug] dispatching IPC job_grab to scheduler: [
[Tue Jul 11 12:29:18 2017] [28398:debug] IPC finished
[Tue Jul 11 12:29:18 2017] [28398:debug] 200 OK (2.422017s, 0.413/s)
[Tue Jul 11 12:29:25 2017] [28398:debug] POST "/api/v1/workers/143/grab_job"

So one thing is making livelog and streaming in a way that they will free the worker after some time, so it
can restart.

Actions #2

Updated by coolo over 7 years ago

Another thing is finding out why we the jobs eat so much memory that we restart the workers so often - but this might be challening in general

Actions #3

Updated by coolo over 7 years ago

There is more DoS going on on this route:


IP1 - - [11/Jul/2017:12:45:37 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP2 - - [11/Jul/2017:12:45:38 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP2 - - [11/Jul/2017:12:45:39 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP1 - - [11/Jul/2017:12:45:40 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:41 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:42 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:43 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:44 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP2 - - [11/Jul/2017:12:45:45 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP2 - - [11/Jul/2017:12:45:53 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP2 - - [11/Jul/2017:12:45:54 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"

Actions #4

Updated by coolo over 7 years ago

If a job post hits a rather idle farm, we see a lot of bored workers hitting badly on our webui.
To demonstrate this I enabled the 'time to serve a request in microseconds' in the apache log (remember this is 1/106)

All these requests hit apache on the same timespot, but the time it takes to deliver ranges from 0.2 to 11 seconds


[11/Jul/2017:13:41:41 +0200](179916) "POST /api/v1/workers/398/grab_job
[11/Jul/2017:13:41:41 +0200](291163) "POST /api/v1/workers/350/grab_job
[11/Jul/2017:13:41:41 +0200](377814) "POST /api/v1/workers/473/grab_job
[11/Jul/2017:13:41:41 +0200](407574) "POST /api/v1/workers/259/grab_job
[11/Jul/2017:13:41:41 +0200](729731) "POST /api/v1/workers/386/grab_job
[11/Jul/2017:13:41:41 +0200](811908) "POST /api/v1/workers/622/grab_job
[11/Jul/2017:13:41:41 +0200](1128412) "POST /api/v1/workers/584/grab_job
[11/Jul/2017:13:41:41 +0200](1245856) "POST /api/v1/workers/378/grab_job
[11/Jul/2017:13:41:41 +0200](1331220) "POST /api/v1/workers/623/grab_job
[11/Jul/2017:13:41:41 +0200](1605371) "POST /api/v1/workers/408/grab_job
[11/Jul/2017:13:41:41 +0200](1618153) "POST /api/v1/workers/510/grab_job
[11/Jul/2017:13:41:41 +0200](1797409) "POST /api/v1/workers/412/grab_job
[11/Jul/2017:13:41:41 +0200](2161480) "POST /api/v1/workers/240/grab_job
[11/Jul/2017:13:41:41 +0200](1997457) "POST /api/v1/workers/389/grab_job
[11/Jul/2017:13:41:41 +0200](2869968) "POST /api/v1/workers/232/grab_job
[11/Jul/2017:13:41:41 +0200](2209118) "POST /api/v1/workers/509/grab_job
[11/Jul/2017:13:41:41 +0200](2767799) "POST /api/v1/workers/400/grab_job
[11/Jul/2017:13:41:41 +0200](2812039) "POST /api/v1/workers/159/grab_job
[11/Jul/2017:13:41:41 +0200](2613062) "POST /api/v1/workers/352/grab_job
[11/Jul/2017:13:41:41 +0200](4483598) "POST /api/v1/workers/612/grab_job
[11/Jul/2017:13:41:41 +0200](4299063) "POST /api/v1/workers/633/grab_job
[11/Jul/2017:13:41:41 +0200](4262191) "POST /api/v1/workers/403/grab_job
[11/Jul/2017:13:41:41 +0200](3987807) "POST /api/v1/workers/420/grab_job
[11/Jul/2017:13:41:41 +0200](4633743) "POST /api/v1/workers/611/grab_job
[11/Jul/2017:13:41:41 +0200](4151166) "POST /api/v1/workers/507/grab_job
[11/Jul/2017:13:41:41 +0200](4551526) "POST /api/v1/workers/359/grab_job
[11/Jul/2017:13:41:41 +0200](5182797) "POST /api/v1/workers/360/grab_job
[11/Jul/2017:13:41:41 +0200](5214027) "POST /api/v1/workers/390/grab_job
[11/Jul/2017:13:41:41 +0200](4699995) "POST /api/v1/workers/573/grab_job
[11/Jul/2017:13:41:41 +0200](4765505) "POST /api/v1/workers/591/grab_job
[11/Jul/2017:13:41:41 +0200](4989254) "POST /api/v1/workers/597/grab_job
[11/Jul/2017:13:41:41 +0200](5435068) "POST /api/v1/workers/628/grab_job
[11/Jul/2017:13:41:41 +0200](4772055) "POST /api/v1/workers/266/grab_job
[11/Jul/2017:13:41:41 +0200](6111164) "POST /api/v1/workers/383/grab_job
[11/Jul/2017:13:41:41 +0200](6031920) "POST /api/v1/workers/251/grab_job
[11/Jul/2017:13:41:41 +0200](6031181) "POST /api/v1/workers/393/grab_job
[11/Jul/2017:13:41:41 +0200](6319021) "POST /api/v1/workers/618/grab_job
[11/Jul/2017:13:41:42 +0200](5863523) "POST /api/v1/workers/512/grab_job
[11/Jul/2017:13:41:41 +0200](6281511) "POST /api/v1/workers/388/grab_job
[11/Jul/2017:13:41:41 +0200](6891344) "POST /api/v1/workers/261/grab_job
[11/Jul/2017:13:41:41 +0200](6853709) "POST /api/v1/workers/602/grab_job
[11/Jul/2017:13:41:41 +0200](6683810) "POST /api/v1/workers/596/grab_job
[11/Jul/2017:13:41:41 +0200](7221353) "POST /api/v1/workers/370/grab_job
[11/Jul/2017:13:41:42 +0200](6630409) "POST /api/v1/workers/629/grab_job
[11/Jul/2017:13:41:41 +0200](7177194) "POST /api/v1/workers/271/grab_job
[11/Jul/2017:13:41:41 +0200](7535749) "POST /api/v1/workers/161/grab_job
[11/Jul/2017:13:41:41 +0200](7205531) "POST /api/v1/workers/603/grab_job
[11/Jul/2017:13:41:41 +0200](7232579) "POST /api/v1/workers/404/grab_job
[11/Jul/2017:13:41:41 +0200](6954727) "POST /api/v1/workers/144/grab_job
[11/Jul/2017:13:41:41 +0200](8040160) "POST /api/v1/workers/233/grab_job
[11/Jul/2017:13:41:42 +0200](7601566) "POST /api/v1/workers/160/grab_job
[11/Jul/2017:13:41:41 +0200](7713551) "POST /api/v1/workers/231/grab_job
[11/Jul/2017:13:41:41 +0200](7802152) "POST /api/v1/workers/355/grab_job
[11/Jul/2017:13:41:41 +0200](7719025) "POST /api/v1/workers/372/grab_job
[11/Jul/2017:13:41:41 +0200](7768048) "POST /api/v1/workers/570/grab_job
[11/Jul/2017:13:41:41 +0200](7820518) "POST /api/v1/workers/363/grab_job
[11/Jul/2017:13:41:41 +0200](8218893) "POST /api/v1/workers/230/grab_job
[11/Jul/2017:13:41:41 +0200](8484057) "POST /api/v1/workers/362/grab_job
[11/Jul/2017:13:41:41 +0200](8472072) "POST /api/v1/workers/392/grab_job
[11/Jul/2017:13:41:41 +0200](8467901) "POST /api/v1/workers/606/grab_job
[11/Jul/2017:13:41:42 +0200](8044246) "POST /api/v1/workers/382/grab_job
[11/Jul/2017:13:41:42 +0200](8038982) "POST /api/v1/workers/361/grab_job
[11/Jul/2017:13:41:41 +0200](8754835) "POST /api/v1/workers/601/grab_job
[11/Jul/2017:13:41:42 +0200](8218667) "POST /api/v1/workers/395/grab_job
[11/Jul/2017:13:41:41 +0200](8335812) "POST /api/v1/workers/574/grab_job
[11/Jul/2017:13:41:41 +0200](9181004) "POST /api/v1/workers/600/grab_job
[11/Jul/2017:13:41:41 +0200](9011104) "POST /api/v1/workers/377/grab_job
[11/Jul/2017:13:41:41 +0200](8919971) "POST /api/v1/workers/566/grab_job
[11/Jul/2017:13:41:41 +0200](8940600) "POST /api/v1/workers/153/grab_job
[11/Jul/2017:13:41:41 +0200](9072331) "POST /api/v1/workers/624/grab_job
[11/Jul/2017:13:41:41 +0200](8924753) "POST /api/v1/workers/396/grab_job
[11/Jul/2017:13:41:41 +0200](9023842) "POST /api/v1/workers/375/grab_job
[11/Jul/2017:13:41:41 +0200](9433790) "POST /api/v1/workers/613/grab_job
[11/Jul/2017:13:41:41 +0200](9001160) "POST /api/v1/workers/592/grab_job
[11/Jul/2017:13:41:42 +0200](8670110) "POST /api/v1/workers/76/grab_job
[11/Jul/2017:13:41:41 +0200](9042698) "POST /api/v1/workers/583/grab_job
[11/Jul/2017:13:41:41 +0200](8990323) "POST /api/v1/workers/82/grab_job
[11/Jul/2017:13:41:41 +0200](8905705) "POST /api/v1/workers/152/grab_job
[11/Jul/2017:13:41:41 +0200](8921303) "POST /api/v1/workers/415/grab_job
[11/Jul/2017:13:41:41 +0200](9372029) "POST /api/v1/workers/587/grab_job
[11/Jul/2017:13:41:41 +0200](9350322) "POST /api/v1/workers/401/grab_job
[11/Jul/2017:13:41:41 +0200](9783477) "POST /api/v1/workers/590/grab_job
[11/Jul/2017:13:41:42 +0200](9017462) "POST /api/v1/workers/568/grab_job
[11/Jul/2017:13:41:42 +0200](9021850) "POST /api/v1/workers/630/grab_job
[11/Jul/2017:13:41:41 +0200](9161160) "POST /api/v1/workers/381/grab_job
[11/Jul/2017:13:41:41 +0200](9468026) "POST /api/v1/workers/387/grab_job
[11/Jul/2017:13:41:42 +0200](9137260) "POST /api/v1/workers/505/grab_job
[11/Jul/2017:13:41:41 +0200](9226341) "POST /api/v1/workers/571/grab_job
[11/Jul/2017:13:41:42 +0200](9102841) "POST /api/v1/workers/410/grab_job
[11/Jul/2017:13:41:41 +0200](10251179) "POST /api/v1/workers/247/grab_job
[11/Jul/2017:13:41:41 +0200](10582613) "POST /api/v1/workers/411/grab_job
[11/Jul/2017:13:41:41 +0200](10639507) "POST /api/v1/workers/141/grab_job
[11/Jul/2017:13:41:41 +0200](10633306) "POST /api/v1/workers/356/grab_job
[11/Jul/2017:13:41:41 +0200](10627713) "POST /api/v1/workers/397/grab_job
[11/Jul/2017:13:41:41 +0200](10552107) "POST /api/v1/workers/354/grab_job
[11/Jul/2017:13:41:41 +0200](10574927) "POST /api/v1/workers/315/grab_job
[11/Jul/2017:13:41:41 +0200](10645949) "POST /api/v1/workers/524/grab_job
[11/Jul/2017:13:41:42 +0200](10330248) "POST /api/v1/workers/380/grab_job
[11/Jul/2017:13:41:41 +0200](10641464) "POST /api/v1/workers/626/grab_job
[11/Jul/2017:13:41:42 +0200](10336100) "POST /api/v1/workers/140/grab_job
[11/Jul/2017:13:41:41 +0200](10776945) "POST /api/v1/workers/409/grab_job
[11/Jul/2017:13:41:41 +0200](10554377) "POST /api/v1/workers/414/grab_job
[11/Jul/2017:13:41:41 +0200](10813010) "POST /api/v1/workers/515/grab_job
[11/Jul/2017:13:41:41 +0200](10754378) "POST /api/v1/workers/369/grab_job
[11/Jul/2017:13:41:41 +0200](10624130) "POST /api/v1/workers/368/grab_job
[11/Jul/2017:13:41:41 +0200](10751225) "POST /api/v1/workers/250/grab_job
[11/Jul/2017:13:41:41 +0200](10780927) "POST /api/v1/workers/73/grab_job
[11/Jul/2017:13:41:41 +0200](10824351) "POST /api/v1/workers/371/grab_job
[11/Jul/2017:13:41:41 +0200](10636071) "POST /api/v1/workers/79/grab_job
[11/Jul/2017:13:41:41 +0200](11038659) "POST /api/v1/workers/586/grab_job
[11/Jul/2017:13:41:41 +0200](10987441) "POST /api/v1/workers/367/grab_job
[11/Jul/2017:13:41:41 +0200](10962790) "POST /api/v1/workers/349/grab_job
[11/Jul/2017:13:41:41 +0200](11029755) "POST /api/v1/workers/413/grab_job
[11/Jul/2017:13:41:41 +0200](11000528) "POST /api/v1/workers/374/grab_job
[11/Jul/2017:13:41:41 +0200](10890835) "POST /api/v1/workers/579/grab_job
[11/Jul/2017:13:41:41 +0200](11156567) "POST /api/v1/workers/619/grab_job
[11/Jul/2017:13:41:41 +0200](11187779) "POST /api/v1/workers/625/grab_job
[11/Jul/2017:13:41:41 +0200](11050256) "POST /api/v1/workers/418/grab_job
[11/Jul/2017:13:41:41 +0200](10653684) "POST /api/v1/workers/149/grab_job

If you combine this with dead workers, you run into the timeout of - I guess 60 seconds - quite easily.

Actions #5

Updated by coolo over 7 years ago

dead workers above means mojo workers not openqa workers

Actions #6

Updated by szarate over 7 years ago

Time to profile grab_job? or maybe have the scheduler pre allocate the jobs or add them to a table when a job can be actually scheduled (i.e, has no dependencies), refreshed every x amount of time (and support both scenarios, because on smaller deployments this would not happen :P)

Actions #7

Updated by okurz over 7 years ago

  • Subject changed from Too many 502 on openqa to [tools]Too many 502 on openqa
Actions #8

Updated by coolo over 7 years ago

  • Has duplicate action #20598: [tools] Upload logs fails because reached maximum attempts limit added
Actions #9

Updated by coolo over 7 years ago

  • Status changed from New to Resolved

we found most of the magic - and created tickets for it.

Actions #10

Updated by EDiGiacinto over 7 years ago

  • Related to action #21006: Very high latency in interactive mode under high load added
Actions #11

Updated by EDiGiacinto over 7 years ago

  • Related to action #19924: job was killed during manual investigation / interactive mode added
Actions #12

Updated by EDiGiacinto over 7 years ago

  • Related to action #19332: Growing number of fails due to 'skipped' modules added
Actions #13

Updated by EDiGiacinto over 7 years ago

  • Related to action #10940: openqa.suse.de: repeating error in log "Query returned more than one row" added
Actions #14

Updated by EDiGiacinto over 7 years ago

  • Related to action #17662: [tools]webUI does not load thumbnails for test details in time or at all added
Actions #15

Updated by EDiGiacinto over 7 years ago

  • Related to action #17116: [tools]Needle editor displays black screen added
Actions #16

Updated by EDiGiacinto over 7 years ago

  • Related to action #21028: Some jobs immediately fail with a 404 (new scheduling logic) added
Actions #17

Updated by EDiGiacinto over 7 years ago

  • Related to action #21818: Two workers can work on the same job at a time added
Actions #18

Updated by okurz over 7 years ago

  • Related to action #23378: [tools][sprint 201709.1] Enhancement and cleanup of "assigned" state added
Actions #19

Updated by EDiGiacinto about 7 years ago

  • Due date set to 2017-10-11

due to changes in a related task

Actions

Also available in: Atom PDF