action #20378
closed[tools]Too many 502 on openqa
100%
Description
Main purpose of this ticket is to track progress
Updated by coolo about 7 years ago
we have tons of 502 all over the day - and our CPU load is fine, so it needs to be somewhere else.
2620:113:80c0:8080:2247:47ff:fee9:7c86 - - [11/Jul/2017:12:29:15 +0200] "GET /tests/1056880/status HTTP/1.1" 502
Looking at what the workers worked on, there is an evil pattern.
[Tue Jul 11 12:12:26 2017] [22001:debug] GET "/tests/1056880/livelog"
[Tue Jul 11 12:13:10 2017] [21427:debug] Stopping worker 22001 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 22001 stopped
[Tue Jul 11 12:12:26 2017] [23263:debug] GET "/tests/1056880/streaming"
[Tue Jul 11 12:13:31 2017] [21427:debug] Stopping worker 23263 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 23263 stopped
[Tue Jul 11 12:21:35 2017] [27304:debug] GET "/tests/1056880/livelog"
[Tue Jul 11 12:24:07 2017] [21427:debug] Stopping worker 27304 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 27304 stopped
[Tue Jul 11 12:21:35 2017] [27407:debug] GET "/tests/1056880/streaming"
[Tue Jul 11 12:23:46 2017] [21427:debug] Stopping worker 27407 gracefully
-> dead until
[Tue Jul 11 12:26:04 2017] [21427:debug] Worker 27407 stopped
So one live view of mnovak 'killed' 4/20 workers.
But another worker was fine and appearing idle:
[Tue Jul 11 12:29:13 2017] [28236:debug] dispatching IPC job_grab to scheduler: [
[Tue Jul 11 12:29:16 2017] [28236:debug] IPC finished
[Tue Jul 11 12:29:16 2017] [28236:debug] 200 OK (2.934915s, 0.341/s)
[Tue Jul 11 12:29:25 2017] [28236:debug] POST "/api/v1/workers/627/grab_job"
The job_grab took 3 second - and most likely blocking, but after that it appears to be free.
A similiar pattern here:
[Tue Jul 11 12:29:16 2017] [28398:debug] dispatching IPC job_grab to scheduler: [
[Tue Jul 11 12:29:18 2017] [28398:debug] IPC finished
[Tue Jul 11 12:29:18 2017] [28398:debug] 200 OK (2.422017s, 0.413/s)
[Tue Jul 11 12:29:25 2017] [28398:debug] POST "/api/v1/workers/143/grab_job"
So one thing is making livelog and streaming in a way that they will free the worker after some time, so it
can restart.
Updated by coolo about 7 years ago
Another thing is finding out why we the jobs eat so much memory that we restart the workers so often - but this might be challening in general
Updated by coolo about 7 years ago
There is more DoS going on on this route:
IP1 - - [11/Jul/2017:12:45:37 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP2 - - [11/Jul/2017:12:45:38 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP2 - - [11/Jul/2017:12:45:39 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP1 - - [11/Jul/2017:12:45:40 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:41 +0200] "GET /tests/1056880/status HTTP/1.1" 200 103 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:42 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:43 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP1 - - [11/Jul/2017:12:45:44 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
IP2 - - [11/Jul/2017:12:45:45 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP2 - - [11/Jul/2017:12:45:53 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
IP2 - - [11/Jul/2017:12:45:54 +0200] "GET /tests/1056880/status HTTP/1.1" 200 16 "https://openqa.suse.de/tests/1056880" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
Updated by coolo about 7 years ago
If a job post hits a rather idle farm, we see a lot of bored workers hitting badly on our webui.
To demonstrate this I enabled the 'time to serve a request in microseconds' in the apache log (remember this is 1/106)
All these requests hit apache on the same timespot, but the time it takes to deliver ranges from 0.2 to 11 seconds
[11/Jul/2017:13:41:41 +0200](179916) "POST /api/v1/workers/398/grab_job
[11/Jul/2017:13:41:41 +0200](291163) "POST /api/v1/workers/350/grab_job
[11/Jul/2017:13:41:41 +0200](377814) "POST /api/v1/workers/473/grab_job
[11/Jul/2017:13:41:41 +0200](407574) "POST /api/v1/workers/259/grab_job
[11/Jul/2017:13:41:41 +0200](729731) "POST /api/v1/workers/386/grab_job
[11/Jul/2017:13:41:41 +0200](811908) "POST /api/v1/workers/622/grab_job
[11/Jul/2017:13:41:41 +0200](1128412) "POST /api/v1/workers/584/grab_job
[11/Jul/2017:13:41:41 +0200](1245856) "POST /api/v1/workers/378/grab_job
[11/Jul/2017:13:41:41 +0200](1331220) "POST /api/v1/workers/623/grab_job
[11/Jul/2017:13:41:41 +0200](1605371) "POST /api/v1/workers/408/grab_job
[11/Jul/2017:13:41:41 +0200](1618153) "POST /api/v1/workers/510/grab_job
[11/Jul/2017:13:41:41 +0200](1797409) "POST /api/v1/workers/412/grab_job
[11/Jul/2017:13:41:41 +0200](2161480) "POST /api/v1/workers/240/grab_job
[11/Jul/2017:13:41:41 +0200](1997457) "POST /api/v1/workers/389/grab_job
[11/Jul/2017:13:41:41 +0200](2869968) "POST /api/v1/workers/232/grab_job
[11/Jul/2017:13:41:41 +0200](2209118) "POST /api/v1/workers/509/grab_job
[11/Jul/2017:13:41:41 +0200](2767799) "POST /api/v1/workers/400/grab_job
[11/Jul/2017:13:41:41 +0200](2812039) "POST /api/v1/workers/159/grab_job
[11/Jul/2017:13:41:41 +0200](2613062) "POST /api/v1/workers/352/grab_job
[11/Jul/2017:13:41:41 +0200](4483598) "POST /api/v1/workers/612/grab_job
[11/Jul/2017:13:41:41 +0200](4299063) "POST /api/v1/workers/633/grab_job
[11/Jul/2017:13:41:41 +0200](4262191) "POST /api/v1/workers/403/grab_job
[11/Jul/2017:13:41:41 +0200](3987807) "POST /api/v1/workers/420/grab_job
[11/Jul/2017:13:41:41 +0200](4633743) "POST /api/v1/workers/611/grab_job
[11/Jul/2017:13:41:41 +0200](4151166) "POST /api/v1/workers/507/grab_job
[11/Jul/2017:13:41:41 +0200](4551526) "POST /api/v1/workers/359/grab_job
[11/Jul/2017:13:41:41 +0200](5182797) "POST /api/v1/workers/360/grab_job
[11/Jul/2017:13:41:41 +0200](5214027) "POST /api/v1/workers/390/grab_job
[11/Jul/2017:13:41:41 +0200](4699995) "POST /api/v1/workers/573/grab_job
[11/Jul/2017:13:41:41 +0200](4765505) "POST /api/v1/workers/591/grab_job
[11/Jul/2017:13:41:41 +0200](4989254) "POST /api/v1/workers/597/grab_job
[11/Jul/2017:13:41:41 +0200](5435068) "POST /api/v1/workers/628/grab_job
[11/Jul/2017:13:41:41 +0200](4772055) "POST /api/v1/workers/266/grab_job
[11/Jul/2017:13:41:41 +0200](6111164) "POST /api/v1/workers/383/grab_job
[11/Jul/2017:13:41:41 +0200](6031920) "POST /api/v1/workers/251/grab_job
[11/Jul/2017:13:41:41 +0200](6031181) "POST /api/v1/workers/393/grab_job
[11/Jul/2017:13:41:41 +0200](6319021) "POST /api/v1/workers/618/grab_job
[11/Jul/2017:13:41:42 +0200](5863523) "POST /api/v1/workers/512/grab_job
[11/Jul/2017:13:41:41 +0200](6281511) "POST /api/v1/workers/388/grab_job
[11/Jul/2017:13:41:41 +0200](6891344) "POST /api/v1/workers/261/grab_job
[11/Jul/2017:13:41:41 +0200](6853709) "POST /api/v1/workers/602/grab_job
[11/Jul/2017:13:41:41 +0200](6683810) "POST /api/v1/workers/596/grab_job
[11/Jul/2017:13:41:41 +0200](7221353) "POST /api/v1/workers/370/grab_job
[11/Jul/2017:13:41:42 +0200](6630409) "POST /api/v1/workers/629/grab_job
[11/Jul/2017:13:41:41 +0200](7177194) "POST /api/v1/workers/271/grab_job
[11/Jul/2017:13:41:41 +0200](7535749) "POST /api/v1/workers/161/grab_job
[11/Jul/2017:13:41:41 +0200](7205531) "POST /api/v1/workers/603/grab_job
[11/Jul/2017:13:41:41 +0200](7232579) "POST /api/v1/workers/404/grab_job
[11/Jul/2017:13:41:41 +0200](6954727) "POST /api/v1/workers/144/grab_job
[11/Jul/2017:13:41:41 +0200](8040160) "POST /api/v1/workers/233/grab_job
[11/Jul/2017:13:41:42 +0200](7601566) "POST /api/v1/workers/160/grab_job
[11/Jul/2017:13:41:41 +0200](7713551) "POST /api/v1/workers/231/grab_job
[11/Jul/2017:13:41:41 +0200](7802152) "POST /api/v1/workers/355/grab_job
[11/Jul/2017:13:41:41 +0200](7719025) "POST /api/v1/workers/372/grab_job
[11/Jul/2017:13:41:41 +0200](7768048) "POST /api/v1/workers/570/grab_job
[11/Jul/2017:13:41:41 +0200](7820518) "POST /api/v1/workers/363/grab_job
[11/Jul/2017:13:41:41 +0200](8218893) "POST /api/v1/workers/230/grab_job
[11/Jul/2017:13:41:41 +0200](8484057) "POST /api/v1/workers/362/grab_job
[11/Jul/2017:13:41:41 +0200](8472072) "POST /api/v1/workers/392/grab_job
[11/Jul/2017:13:41:41 +0200](8467901) "POST /api/v1/workers/606/grab_job
[11/Jul/2017:13:41:42 +0200](8044246) "POST /api/v1/workers/382/grab_job
[11/Jul/2017:13:41:42 +0200](8038982) "POST /api/v1/workers/361/grab_job
[11/Jul/2017:13:41:41 +0200](8754835) "POST /api/v1/workers/601/grab_job
[11/Jul/2017:13:41:42 +0200](8218667) "POST /api/v1/workers/395/grab_job
[11/Jul/2017:13:41:41 +0200](8335812) "POST /api/v1/workers/574/grab_job
[11/Jul/2017:13:41:41 +0200](9181004) "POST /api/v1/workers/600/grab_job
[11/Jul/2017:13:41:41 +0200](9011104) "POST /api/v1/workers/377/grab_job
[11/Jul/2017:13:41:41 +0200](8919971) "POST /api/v1/workers/566/grab_job
[11/Jul/2017:13:41:41 +0200](8940600) "POST /api/v1/workers/153/grab_job
[11/Jul/2017:13:41:41 +0200](9072331) "POST /api/v1/workers/624/grab_job
[11/Jul/2017:13:41:41 +0200](8924753) "POST /api/v1/workers/396/grab_job
[11/Jul/2017:13:41:41 +0200](9023842) "POST /api/v1/workers/375/grab_job
[11/Jul/2017:13:41:41 +0200](9433790) "POST /api/v1/workers/613/grab_job
[11/Jul/2017:13:41:41 +0200](9001160) "POST /api/v1/workers/592/grab_job
[11/Jul/2017:13:41:42 +0200](8670110) "POST /api/v1/workers/76/grab_job
[11/Jul/2017:13:41:41 +0200](9042698) "POST /api/v1/workers/583/grab_job
[11/Jul/2017:13:41:41 +0200](8990323) "POST /api/v1/workers/82/grab_job
[11/Jul/2017:13:41:41 +0200](8905705) "POST /api/v1/workers/152/grab_job
[11/Jul/2017:13:41:41 +0200](8921303) "POST /api/v1/workers/415/grab_job
[11/Jul/2017:13:41:41 +0200](9372029) "POST /api/v1/workers/587/grab_job
[11/Jul/2017:13:41:41 +0200](9350322) "POST /api/v1/workers/401/grab_job
[11/Jul/2017:13:41:41 +0200](9783477) "POST /api/v1/workers/590/grab_job
[11/Jul/2017:13:41:42 +0200](9017462) "POST /api/v1/workers/568/grab_job
[11/Jul/2017:13:41:42 +0200](9021850) "POST /api/v1/workers/630/grab_job
[11/Jul/2017:13:41:41 +0200](9161160) "POST /api/v1/workers/381/grab_job
[11/Jul/2017:13:41:41 +0200](9468026) "POST /api/v1/workers/387/grab_job
[11/Jul/2017:13:41:42 +0200](9137260) "POST /api/v1/workers/505/grab_job
[11/Jul/2017:13:41:41 +0200](9226341) "POST /api/v1/workers/571/grab_job
[11/Jul/2017:13:41:42 +0200](9102841) "POST /api/v1/workers/410/grab_job
[11/Jul/2017:13:41:41 +0200](10251179) "POST /api/v1/workers/247/grab_job
[11/Jul/2017:13:41:41 +0200](10582613) "POST /api/v1/workers/411/grab_job
[11/Jul/2017:13:41:41 +0200](10639507) "POST /api/v1/workers/141/grab_job
[11/Jul/2017:13:41:41 +0200](10633306) "POST /api/v1/workers/356/grab_job
[11/Jul/2017:13:41:41 +0200](10627713) "POST /api/v1/workers/397/grab_job
[11/Jul/2017:13:41:41 +0200](10552107) "POST /api/v1/workers/354/grab_job
[11/Jul/2017:13:41:41 +0200](10574927) "POST /api/v1/workers/315/grab_job
[11/Jul/2017:13:41:41 +0200](10645949) "POST /api/v1/workers/524/grab_job
[11/Jul/2017:13:41:42 +0200](10330248) "POST /api/v1/workers/380/grab_job
[11/Jul/2017:13:41:41 +0200](10641464) "POST /api/v1/workers/626/grab_job
[11/Jul/2017:13:41:42 +0200](10336100) "POST /api/v1/workers/140/grab_job
[11/Jul/2017:13:41:41 +0200](10776945) "POST /api/v1/workers/409/grab_job
[11/Jul/2017:13:41:41 +0200](10554377) "POST /api/v1/workers/414/grab_job
[11/Jul/2017:13:41:41 +0200](10813010) "POST /api/v1/workers/515/grab_job
[11/Jul/2017:13:41:41 +0200](10754378) "POST /api/v1/workers/369/grab_job
[11/Jul/2017:13:41:41 +0200](10624130) "POST /api/v1/workers/368/grab_job
[11/Jul/2017:13:41:41 +0200](10751225) "POST /api/v1/workers/250/grab_job
[11/Jul/2017:13:41:41 +0200](10780927) "POST /api/v1/workers/73/grab_job
[11/Jul/2017:13:41:41 +0200](10824351) "POST /api/v1/workers/371/grab_job
[11/Jul/2017:13:41:41 +0200](10636071) "POST /api/v1/workers/79/grab_job
[11/Jul/2017:13:41:41 +0200](11038659) "POST /api/v1/workers/586/grab_job
[11/Jul/2017:13:41:41 +0200](10987441) "POST /api/v1/workers/367/grab_job
[11/Jul/2017:13:41:41 +0200](10962790) "POST /api/v1/workers/349/grab_job
[11/Jul/2017:13:41:41 +0200](11029755) "POST /api/v1/workers/413/grab_job
[11/Jul/2017:13:41:41 +0200](11000528) "POST /api/v1/workers/374/grab_job
[11/Jul/2017:13:41:41 +0200](10890835) "POST /api/v1/workers/579/grab_job
[11/Jul/2017:13:41:41 +0200](11156567) "POST /api/v1/workers/619/grab_job
[11/Jul/2017:13:41:41 +0200](11187779) "POST /api/v1/workers/625/grab_job
[11/Jul/2017:13:41:41 +0200](11050256) "POST /api/v1/workers/418/grab_job
[11/Jul/2017:13:41:41 +0200](10653684) "POST /api/v1/workers/149/grab_job
If you combine this with dead workers, you run into the timeout of - I guess 60 seconds - quite easily.
Updated by coolo about 7 years ago
dead workers above means mojo workers not openqa workers
Updated by szarate about 7 years ago
Time to profile grab_job? or maybe have the scheduler pre allocate the jobs or add them to a table when a job can be actually scheduled (i.e, has no dependencies), refreshed every x amount of time (and support both scenarios, because on smaller deployments this would not happen :P)
Updated by okurz about 7 years ago
- Subject changed from Too many 502 on openqa to [tools]Too many 502 on openqa
Updated by coolo about 7 years ago
- Has duplicate action #20598: [tools] Upload logs fails because reached maximum attempts limit added
Updated by coolo about 7 years ago
- Status changed from New to Resolved
we found most of the magic - and created tickets for it.
Updated by EDiGiacinto about 7 years ago
- Related to action #21006: Very high latency in interactive mode under high load added
Updated by EDiGiacinto about 7 years ago
- Related to action #19924: job was killed during manual investigation / interactive mode added
Updated by EDiGiacinto about 7 years ago
- Related to action #19332: Growing number of fails due to 'skipped' modules added
Updated by EDiGiacinto about 7 years ago
- Related to action #10940: openqa.suse.de: repeating error in log "Query returned more than one row" added
Updated by EDiGiacinto about 7 years ago
- Related to action #17662: [tools]webUI does not load thumbnails for test details in time or at all added
Updated by EDiGiacinto about 7 years ago
- Related to action #17116: [tools]Needle editor displays black screen added
Updated by EDiGiacinto about 7 years ago
- Related to action #21028: Some jobs immediately fail with a 404 (new scheduling logic) added
Updated by EDiGiacinto about 7 years ago
- Related to action #21818: Two workers can work on the same job at a time added
Updated by okurz about 7 years ago
- Related to action #23378: [tools][sprint 201709.1] Enhancement and cleanup of "assigned" state added
Updated by EDiGiacinto almost 7 years ago
- Due date set to 2017-10-11
due to changes in a related task