Project

General

Profile

action #10050

[OOM] isotovideo killed openqa-websockets due to OOM

Added by mlin7442 almost 7 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
-
Start date:
2015-12-29
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

isotovideo killed openqa-websockets due to OOM, it was happened at 2015-12-28. see more details in oom_20151228.log file.

I know the similar issue has been happened, so just for a record.

oom_20151228.log (143 KB) oom_20151228.log mlin7442, 2015-12-29 05:23

History

#1 Updated by coolo almost 7 years ago

  • Assignee set to oholecek

The websocket code seems to have a heavy memory leak. the worker really shouldn't require 2.5G of RAM:


5524 _openqa+ 20 0 2680852 2,492g 2432 S 0,000 1,976 63:57.09 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
25030 _openqa+ 20 0 2507720 2,328g 2844 S 0,000 1,846 36:52.36 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
25083 _openqa+ 20 0 2389344 2,216g 2932 S 0,330 1,757 36:57.41 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
24996 _openqa+ 20 0 1622128 1,484g 2968 S 0,000 1,177 32:39.39 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
25156 _openqa+ 20 0 1536432 1,402g 2908 S 0,000 1,112 29:21.12 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
27154 _openqa+ 20 0 1652892 1,244g 20024 S 5,611 0,986 12:43.70 /usr/bin/perl -w /usr/bin/isotovideo -d
27765 _openqa+ 20 0 1586892 1,147g 20016 S 16,172 0,909 11:15.36 /usr/bin/perl -w /usr/bin/isotovideo -d
27173 _openqa+ 20 0 5953524 1,145g 18400 S 0,330 0,908 27:52.18 /usr/bin/qemu-system-x86_64 -machine accel=kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga cirrus -m 102+
27772 _openqa+ 20 0 5964276 1,144g 18028 S 57,426 0,907 30:19.19 /usr/bin/qemu-system-x86_64 -machine accel=kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga cirrus -m 102+
32284 _openqa+ 20 0 3644116 1,055g 18336 S 5,941 0,836 10:23.33 /usr/bin/qemu-system-x86_64 -machine accel=kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga cirrus -m 102+
12616 geekote+ 20 0 1025964 897872 4192 S 3,300 0,679 205:20.14 /usr/bin/perl /usr/share/openqa/script/openqa-websockets
24918 _openqa+ 20 0 954648 888456 2884 S 0,000 0,672 28:06.64 /usr/bin/perl /usr/share/openqa/script/worker --instance 2

#2 Updated by coolo almost 7 years ago

This is how it looks after a fresh restart:


2586 _openqa+ 20 0 101992 40160 6696 S 1,852 0,030 0:00.50 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
2598 _openqa+ 20 0 101860 40064 6740 S 0,000 0,030 0:00.50 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
2600 _openqa+ 20 0 101860 40060 6740 S 0,000 0,030 0:00.49 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
2606 _openqa+ 20 0 101860 40056 6748 S 0,000 0,030 0:00.51 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
2596 _openqa+ 20 0 101860 40036 6720 S 0,000 0,030 0:00.53 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
2594 _openqa+ 20 0 101860 40028 6712 S 0,000 0,030 0:00.48 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
2604 _openqa+ 20 0 101860 40020 6712 S 0,000 0,030 0:00.51 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
2602 _openqa+ 20 0 101844 39992 6688 S 0,000 0,030 0:00.52 /usr/bin/perl /usr/share/openqa/script/worker --instance 2

#3 Updated by oholecek almost 7 years ago

To clarify, these are workers with WORKER_USE_WEBSOCKETS enabled, right? Or else what is the connection between workers and ws?

#4 Updated by coolo almost 7 years ago

I assumed that we do job_grab through WS now

#5 Updated by coolo almost 7 years ago

I have no idea how to takle this. The memory leak can be anywhere in the stack - a look at the websocket service didn't reveal anything obvious. So a cron job to restart the services every midnight is the only option I see ;(

#6 Updated by okurz almost 7 years ago

I would suggest to go with the "workaround" of doing daily restart with a proper reference to this issue but keep this issue open with lower priority for the future and try to narrow down with some better tracing tools where the memory leak might be, e.g. come up with a better automatic test confirming the memory leak first.

#7 Updated by coolo almost 6 years ago

  • Status changed from New to Closed

I checked one worker that runs for 6 days and it's at 180M, which is peanuts

Also available in: Atom PDF