action #34750
closed[functional][y][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call
0%
Description
Observation¶
New SLE15 build was expected today but was not there.
process table on osd:
root 1710 0.0 0.0 18880 1788 ? Ss Apr04 0:08 /usr/sbin/cron -n
root 32679 0.0 0.0 76004 3928 ? S 01:30 0:00 \_ /usr/sbin/CRON -n
geekote+ 32710 0.0 0.0 11872 2540 ? Ss 01:30 0:00 | \_ /bin/sh -c /opt/openqa-scripts/openqa-iso-sync-sles sle15_sp0 >> /var/log/openqa_rsync.log 2>&1
geekote+ 32730 0.0 0.0 11876 2712 ? S 01:30 0:00 | \_ /bin/bash /opt/openqa-scripts/openqa-iso-sync-sles sle15_sp0
geekote+ 4604 0.0 0.2 117804 44448 ? S 01:32 0:01 | \_ /usr/bin/perl -w /opt/openqa-scripts/rsync.pl --host openqa.suse.de --verbose --deprioritize-or-cancel sle15_sp0
geekote+ 14297 0.0 0.0 24636 3680 ? S 01:39 0:00 | \_ rsync --checksum --verbose dist.suse.de::repos/SUSE:/SLE-15:/GA:/TEST/images/iso/SLE-15-Installer-DVD-s390x-Build562.6-Media1.iso /var/lib/o
And it's 06:22 by now so the rsync-process is stuck for nearly 5 hours.
The logfile /var/log/openqa_rsync.log shows that build 562.6 was picked up for syncing but never finished. See attachment for details
$ sudo strace -f -p 14297
Process 14297 attached
select(4, [3], [], [3], {1, 697003}) = 0 (Timeout)
select(4, [3], [], [3], {60, 0}) = 0 (Timeout)
select(4, [3], [], [3], {60, 0}) = 0 (Timeout)
select(4, [3], [], [3], {60, 0}) = 0 (Timeout)
select(4, [3], [], [3], {60, 0}
$ sudo cat /proc/14297/stack
[<ffffffff8122a3a0>] poll_schedule_timeout+0x50/0x80
[<ffffffff8122adb6>] do_select+0x5b6/0x770
[<ffffffff8122b141>] core_sys_select+0x1d1/0x2f0
[<ffffffff8122b31b>] SyS_select+0xbb/0x100
[<ffffffff81640809>] entry_SYSCALL_64_fastpath+0x22/0xba
[<ffffffffffffffff>] 0xffffffffffffffff
Problem¶
I could not find more information. A web research only brought the insight this problem also appeared for some others.
Suggestion¶
rsync has a timeout option. By default it is 0 so disabled. Having a timeout sounds like a good idea.
Mitigation¶
I killed the rsync.pl and rsync processes and triggered manually with
/usr/bin/perl -w /opt/openqa-scripts/rsync.pl --host openqa.suse.de --verbose --add-existing --deprioritize-or-cancel sle15_sp0 | tee -a /var/log/openqa_rsync.log 2>&1
Files
Updated by okurz over 6 years ago
- Subject changed from rsync process triggered by rsync.pl is stuck for many hours to [functional][u][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call
- Due date set to 2018-05-22
- Status changed from In Progress to Workable
- Assignee deleted (
okurz) - Priority changed from Immediate to Normal
- Target version set to Milestone 16
[12 Apr 2018 06:41:44] build 562.6 synced+triggered
With this I completed the mitigation. Next step should be really easy and a good newcomer task: Make sure rsync is called with a sensible timeout value from within rsync.pl
Updated by okurz over 6 years ago
- Subject changed from [functional][u][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call to [functional][y][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call
too much tasks for the [u] team for now, maybe [y] can handle
Updated by mloviska over 6 years ago
As far as I understood the code we already have one hour timeout for I/O operations set. (timeout => 3600)
According the log it looks like the problem came after the file transfer, when it was trying to work with /var/lib/openqa/.config/openqa/openqa-iso-sync.lock.
Thu Apr 12 01:40:01 CEST 2018
lockfile: Sorry, giving up on "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock"
can't get lock file
.
.
.
Thu Apr 12 06:20:01 CEST 2018
Thu Apr 12 06:20:01 CEST 2018
lockfile: Sorry, giving up on "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock"
can't get lock file
lockfile: Sorry, giving up on "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock"
can't get lock file
Traceback (most recent call last):
File "/usr/bin/withlock", line 199, in <module>
main()
File "/usr/bin/withlock", line 137, in main
lock = open(lockfile, 'w')
IOError: [Errno 13] Permission denied: '/var/lib/openqa/.config/openqa/openqa-iso-sync.lock'
/opt/openqa-scripts/openqa-iso-sync-sles: line 42: 4604 Terminated ${0%/*}/rsync.pl --host openqa.suse.de --verbose $rsync_opts $p >> /var/log/openqa_rsync.log 2>&1
Thu Apr 12 06:25:01 CEST 2018
Thu Apr 12 06:25:01 CEST 2018
Configured to deprioritize or cancel jobs from previous builds
Updated by riafarov over 6 years ago
Do we remove the lock when fail to sync after 1h? I guess we could at least do that, so will be able to retry.
Updated by mloviska over 6 years ago
I do not really see any explicit file lock/unlock in the code. Maybe it is done by bsdtar ? But why it would lock "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock" ?
Updated by mloviska over 6 years ago
Updated by mloviska over 6 years ago
- Status changed from Workable to Feedback
Updated by okurz over 6 years ago
mloviska wrote:
I do not really see any explicit file lock/unlock in the code. Maybe it is done by bsdtar ? But why it would lock "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock" ?
I doubt bsdtar is mangling with the lock.
mloviska wrote:
What about to define and locktimeout for /var/lib/openqa/.config/openqa/openqa-iso-sync.lock in openqa-iso-sync-sles script?
hm, I don't know if this would help. Could introduce even more problems.
Updated by mloviska over 6 years ago
bsdtar is definitely not the one to blame. But on the other hand rsync.pl has already set an timeout for IO. According to the logs it looks like the file transfer was not an issue, otherwise it would timeout after an hour. IMHO the root cause is most likely the lock file.