action #37782
closed
[kernel][functional][u][medium] test fails in execute_test_run because it cannot handle broken pipes
Added by nicksinger over 6 years ago.
Updated about 6 years ago.
Category:
Bugs in existing tests
Description
Observation¶
openQA test in scenario sle-12-SP4-Server-DVD-s390x-fs_stress@s390x-kvm-sle12 fails in
execute_test_run due to a broken pipe.
Suggestions to improve this test¶
To me this issue looks like some timeout after copying many files around. IMHO this can always happen if we relay on long open TCP sessions.
Without looking into test_fs_stress-run, I'd assume it uses SSH. If so, one could try to increase the ssh timeout value (https://askubuntu.com/questions/127369/how-to-prevent-write-failed-broken-pipe-on-ssh-connection)
Another idea would be to implement retries (e.g. only fail after 3 retries).
Reproducible¶
Fails (until now) only at build 0263 and should be sporadic.
- Subject changed from [functional][s390x][medium] test fails in execute_test_run because it cannot handle broken pipes to [functional][medium][u] test fails in execute_test_run because it cannot handle broken pipes
- Related to action #34012: [kernel] too generic test failure in "execute_test_run" for stress tests, was previously something more specific like "acceptance_fs_stress" added
- Subject changed from [functional][medium][u] test fails in execute_test_run because it cannot handle broken pipes to [kernel][functional][u][medium] test fails in execute_test_run because it cannot handle broken pipes
- Assignee set to yosun
Hi @yosun, as discussed in #34012 I assume you want to pick it up?
- Assignee changed from yosun to okurz
Thanks for the info.
This failed by "packet_write_wait: Connection to 10.161.145.16 port 22: Broken pipe", and checked serial log don't have any oops and crash info in it. It fails when doing "/usr/share/qa/tools/file_copy -j 4 -i 5 -s 5000", which means run in 4 parallels, iteration 5 times with 5000MB(5GB) files copy in 1 time. In this high fs stress, and this test KVM only has 912MB RAM, it's easy to cause OOM then system ramdom kill process, and then ssh service or related process being killed randomly.
Log just shows it lose ssh connect, neither a kernel nor test issue but a random issue. The solution is give larger RAM or harddisk for this poor s390x KVM to avoid/reduce this kind of issue. We have reported related issue as bug to Lab team, but Lab's developer didn't take it because of resource too limited.
In all, I suggest solve it in tools team to give more resources to this KVM or just reject this kind of ticket.
- Status changed from New to Feedback
yosun wrote:
We have reported related issue as bug to Lab team, but Lab's developer didn't take it because of resource too limited.
What "Lab's developer" are you referring to? Do you have a ticket for that?
In all, I suggest solve it in tools team to give more resources to this KVM or just reject this kind of ticket.
I don't think this is related to the tools team because when we talk about KVM we just configure the machine accordingly. We could do that but I want to wait for your response first.
- Target version set to future
I was wrong, it's not failed by OOM issue. I check test code and log in https://openqa.suse.de/tests/1777070/file/autoinst-log.txt again, I found this test randomly fail by following line timeout in 90 second:
assert_script_run("tar cjf $tarball -C /var/log/qa/ctcs2 ls /var/log/qa/ctcs2/
");
It only fail after test fs_stress, and this line just after test finish. I guess after fs stress, system need more time to get enough space to create a log tarbal. I think the solution is add following lines before fail part:
if (get_var("QA_TESTSUITE")=="fs_stress") {
sleep 120;
}
I guess rather than this big sleep time we should wait for what we really need, e.g. look for free space and wait until there is more free space again. Or just save to a different location, e.g. ram disk /dev/shm
- Assignee changed from okurz to yosun
@yosun anything else what I could help with here?
It's helpful in #8, thanks! But I still didn't find time to work on it. Maybe I could work on it in SLE12SP4 RC period.
- Status changed from Feedback to Resolved
Fixed with add a sync before tar logs. I tried some tests not reproduce this issue. Feel free to reopen it, when happen again.
Also available in: Atom
PDF