action #39068

Webui killed by out of memory in o3 (triggered by postgresql)

Added by szarate over 1 year ago. Updated over 1 year ago.

Status:RejectedStart date:01/08/2018
Priority:HighDue date:
Assignee:-% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

So, I just noticed that o3 webui was down, looking at the journal, there are a lot of the following messages:

Aug 01 16:03:35 ariel openqa[29052]: Use of uninitialized value $distri in string eq at template
Aug 01 16:03:35 ariel openqa[29052]:         branding/openSUSE/external_reporting.html.ep line 103 (#1)
Aug 01 16:07:37 ariel openqa[29052]:         (in cleanup) Can't call method "stream" on an undefined value at
Aug 01 16:07:37 ariel openqa[29052]:         /usr/lib/perl5/vendor_perl/5.18.2/Mojo/RabbitMQ/Client.pm line 544 during global destruction (#2)                                                                                       
Aug 01 16:07:37 ariel openqa[29052]:     (W misc) This prefix usually indicates that a DESTROY() method raised
Aug 01 16:07:37 ariel openqa[29052]:     the indicated exception.  Since destructors are usually called by the
Aug 01 16:07:37 ariel openqa[29052]:     system at arbitrary points during execution, and often a vast number of
Aug 01 16:07:37 ariel openqa[29052]:     times, the warning is issued only once for any number of failures that
Aug 01 16:07:37 ariel openqa[29052]:     would otherwise result in the same message being repeated.
Aug 01 16:07:37 ariel openqa[29052]:
Aug 01 16:07:37 ariel openqa[29052]:     Failure of user callbacks dispatched using the G_KEEPERR flag could
Aug 01 16:07:37 ariel openqa[29052]:     also result in this warning.  See "G_KEEPERR" in perlcall.
Aug 01 16:07:37 ariel openqa[29052]:
Aug 01 16:15:12 ariel openqa[29052]: DBIx::Class::Storage::DBI::_gen_sql_bind(): DateTime objects passed to search() are not supported properly (InflateColumn::DateTime formats and settings are not respected.) See ".. format a Dat
Aug 01 17:47:51 ariel openqa[29052]:         (in cleanup) Can't call method "stream" on an undefined value at
Aug 01 17:47:51 ariel openqa[29052]:         /usr/lib/perl5/vendor_perl/5.18.2/Mojo/RabbitMQ/Client.pm line 544 during global destruction (#1)                                                                                       
Aug 01 17:47:51 ariel openqa[29052]:     (W misc) This prefix usually indicates that a DESTROY() method raised
Aug 01 17:47:51 ariel openqa[29052]:     the indicated exception.  Since destructors are usually called by the
Aug 01 17:47:51 ariel openqa[29052]:     system at arbitrary points during execution, and often a vast number of
Aug 01 17:47:51 ariel openqa[29052]:     times, the warning is issued only once for any number of failures that
Aug 01 17:47:51 ariel openqa[29052]:     would otherwise result in the same message being repeated.
Aug 01 17:47:51 ariel openqa[29052]:
Aug 01 17:47:51 ariel openqa[29052]:     Failure of user callbacks dispatched using the G_KEEPERR flag could
Aug 01 17:47:51 ariel openqa[29052]:     also result in this warning.  See "G_KEEPERR" in perlcall.
Aug 01 17:47:51 ariel openqa[29052]:

Indeed the openqa process was killed, and the app died because not being able to fork, I wonder if there's a leak?:

Aug 01 21:09:16 ariel openqa[29052]: Can't fork: Cannot allocate memory at
Aug 01 21:09:16 ariel openqa[29052]:         /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Server/Prefork.pm line 142 (#1)                                                                                                                  
Aug 01 21:09:16 ariel openqa[29052]:     (F) A fatal error occurred while trying to fork while opening a
Aug 01 21:09:16 ariel openqa[29052]:     pipeline.
Aug 01 21:09:16 ariel openqa[29052]:

Aug 01 21:09:16 ariel openqa[29052]: Uncaught exception from user code:
Aug 01 21:09:16 ariel openqa[29052]:         Can't fork: Cannot allocate memory at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Server/Prefork.pm line 142.                                                                                
Aug 01 21:09:16 ariel openqa[29052]:         Mojo::Server::Prefork::_spawn('Mojo::Server::Prefork=HASH(0x9caabe0)') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Server/Prefork.pm line 100                                      
Aug 01 21:09:16 ariel openqa[29052]:         Mojo::Server::Prefork::_manage('Mojo::Server::Prefork=HASH(0x9caabe0)') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Server/Prefork.pm line 85                                      
Aug 01 21:09:16 ariel openqa[29052]:         Mojo::Server::Prefork::run('Mojo::Server::Prefork=HASH(0x9caabe0)') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Command/prefork.pm line 31                                  
Aug 01 21:09:16 ariel openqa[29052]:         Mojolicious::Command::prefork::run('Mojolicious::Command::prefork=HASH(0x9cae588)', '--proxy', '-i', 100, '-H', 400, '-w', 20, '-G', ...) called at /usr/lib/perl5/vendor_perl/5.18.2/Moj
Aug 01 21:09:16 ariel openqa[29052]:         Mojolicious::Commands::run('Mojolicious::Commands=HASH(0x8a33ef0)', 'prefork', '-m', 'production', '--proxy', '-i', 100, '-H', 400, ...) called at /usr/lib/perl5/vendor_perl/5.18.2/Mojo
Aug 01 21:09:16 ariel openqa[29052]:         Mojolicious::start('OpenQA::WebAPI=HASH(0x1700280)') called at /usr/lib/perl5/vendor_perl/5.18.2/Mojolicious/Commands.pm line 71                                                        
Aug 01 21:09:16 ariel openqa[29052]:         Mojolicious::Commands::start_app('Mojolicious::Commands', 'OpenQA::WebAPI') called at /usr/share/openqa/script/../lib/OpenQA/WebAPI.pm line 486                                         
Aug 01 21:09:16 ariel openqa[29052]:         OpenQA::WebAPI::run() called at /usr/share/openqa/script/openqa line 34                                                                                                                 
Aug 01 21:09:17 ariel systemd[1]: openqa-webui.service: Main process exited, code=exited, status=12/n/a
Aug 01 21:09:18 ariel systemd[1]: openqa-webui.service: Unit entered failed state.
Aug 01 21:09:18 ariel systemd[1]: openqa-webui.service: Failed with result 'exit-code'.

dmesg_03.txt Magnifier - dmesg output (166 KB) szarate, 23/08/2018 02:49 pm


Related issues

Related to openQA Project - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway ... Resolved 15/08/2018
Related to openQA Project - action #39629: openQA Scheduler refactor fallout Resolved 13/08/2018

History

#1 Updated by szarate over 1 year ago

  • Subject changed from out of memory in o3 to Webui killed by out of memory in o3

#2 Updated by szarate over 1 year ago

I think we need to revisit the actual parameters we use to start our openQA instance, as it looks like either Mojo or apache cannot cope with them...

#3 Updated by szarate over 1 year ago

  • Priority changed from Normal to Urgent
  • Target version set to Current Sprint

#4 Updated by szarate over 1 year ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added

#5 Updated by szarate over 1 year ago

  • Related to action #39629: openQA Scheduler refactor fallout added

#6 Updated by szarate over 1 year ago

By this point in time, o3 already had the blocked_by calculation in place. and I have not seen the oom killer starting again, after commenting the blocked_by calculation/deploying old scheduler

#7 Updated by szarate over 1 year ago

  • File dmesg_03.txtMagnifier added
  • Priority changed from Urgent to High

This one needs investigation,

[Tue Aug 21 11:40:18 2018] postgres invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
[Tue Aug 21 11:40:18 2018] postgres cpuset=/ mems_allowed=0

Perhaps postgresql needs tunning too.

#8 Updated by szarate over 1 year ago

  • Subject changed from Webui killed by out of memory in o3 to Webui killed by out of memory in o3 (triggered by postgresql)

#9 Updated by coolo over 1 year ago

  • Status changed from New to Rejected

It's hard to say what was up there at that time, so drop that

#10 Updated by coolo over 1 year ago

  • Target version changed from Current Sprint to Done

Also available in: Atom PDF