Project

General

Profile

Actions

action #34750

closed

[functional][y][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call

Added by okurz about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Start date:
2018-04-12
Due date:
2018-05-22
% Done:

0%

Estimated time:
Difficulty:
easy

Description

Observation

New SLE15 build was expected today but was not there.

process table on osd:

root      1710  0.0  0.0  18880  1788 ?        Ss   Apr04   0:08 /usr/sbin/cron -n
root     32679  0.0  0.0  76004  3928 ?        S    01:30   0:00  \_ /usr/sbin/CRON -n
geekote+ 32710  0.0  0.0  11872  2540 ?        Ss   01:30   0:00  |   \_ /bin/sh -c     /opt/openqa-scripts/openqa-iso-sync-sles sle15_sp0 >> /var/log/openqa_rsync.log 2>&1
geekote+ 32730  0.0  0.0  11876  2712 ?        S    01:30   0:00  |       \_ /bin/bash /opt/openqa-scripts/openqa-iso-sync-sles sle15_sp0
geekote+  4604  0.0  0.2 117804 44448 ?        S    01:32   0:01  |           \_ /usr/bin/perl -w /opt/openqa-scripts/rsync.pl --host openqa.suse.de --verbose --deprioritize-or-cancel sle15_sp0
geekote+ 14297  0.0  0.0  24636  3680 ?        S    01:39   0:00  |               \_ rsync --checksum --verbose dist.suse.de::repos/SUSE:/SLE-15:/GA:/TEST/images/iso/SLE-15-Installer-DVD-s390x-Build562.6-Media1.iso /var/lib/o

And it's 06:22 by now so the rsync-process is stuck for nearly 5 hours.

The logfile /var/log/openqa_rsync.log shows that build 562.6 was picked up for syncing but never finished. See attachment for details

$ sudo strace -f -p 14297
Process 14297 attached
select(4, [3], [], [3], {1, 697003})    = 0 (Timeout)
select(4, [3], [], [3], {60, 0})        = 0 (Timeout)
select(4, [3], [], [3], {60, 0})        = 0 (Timeout)
select(4, [3], [], [3], {60, 0})        = 0 (Timeout)
select(4, [3], [], [3], {60, 0}
$ sudo cat /proc/14297/stack
[<ffffffff8122a3a0>] poll_schedule_timeout+0x50/0x80
[<ffffffff8122adb6>] do_select+0x5b6/0x770
[<ffffffff8122b141>] core_sys_select+0x1d1/0x2f0
[<ffffffff8122b31b>] SyS_select+0xbb/0x100
[<ffffffff81640809>] entry_SYSCALL_64_fastpath+0x22/0xba
[<ffffffffffffffff>] 0xffffffffffffffff

Problem

I could not find more information. A web research only brought the insight this problem also appeared for some others.

Suggestion

rsync has a timeout option. By default it is 0 so disabled. Having a timeout sounds like a good idea.

Mitigation

I killed the rsync.pl and rsync processes and triggered manually with

/usr/bin/perl -w /opt/openqa-scripts/rsync.pl --host openqa.suse.de --verbose --add-existing --deprioritize-or-cancel sle15_sp0 | tee -a /var/log/openqa_rsync.log 2>&1

Files

openqa_rsync.log.xz (168 KB) openqa_rsync.log.xz okurz, 2018-04-12 04:26
Actions #1

Updated by okurz about 6 years ago

  • Subject changed from rsync process triggered by rsync.pl is stuck for many hours to [functional][u][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call
  • Due date set to 2018-05-22
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)
  • Priority changed from Immediate to Normal
  • Target version set to Milestone 16

[12 Apr 2018 06:41:44] build 562.6 synced+triggered

With this I completed the mitigation. Next step should be really easy and a good newcomer task: Make sure rsync is called with a sensible timeout value from within rsync.pl

Actions #2

Updated by cwh about 6 years ago

  • Difficulty set to easy
Actions #3

Updated by okurz almost 6 years ago

  • Subject changed from [functional][u][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call to [functional][y][easy]rsync process triggered by rsync.pl is stuck for many hours -> add timeout parameter to rsync call

too much tasks for the [u] team for now, maybe [y] can handle

Actions #4

Updated by mloviska almost 6 years ago

  • Assignee set to mloviska
Actions #5

Updated by mloviska almost 6 years ago

As far as I understood the code we already have one hour timeout for I/O operations set. (timeout => 3600)
According the log it looks like the problem came after the file transfer, when it was trying to work with /var/lib/openqa/.config/openqa/openqa-iso-sync.lock.

Thu Apr 12 01:40:01 CEST 2018
lockfile: Sorry, giving up on "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock"
can't get lock file
.
.
.
Thu Apr 12 06:20:01 CEST 2018
Thu Apr 12 06:20:01 CEST 2018
lockfile: Sorry, giving up on "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock"
can't get lock file
lockfile: Sorry, giving up on "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock"
can't get lock file
Traceback (most recent call last):
  File "/usr/bin/withlock", line 199, in <module>
    main()
  File "/usr/bin/withlock", line 137, in main
    lock = open(lockfile, 'w')
IOError: [Errno 13] Permission denied: '/var/lib/openqa/.config/openqa/openqa-iso-sync.lock'
/opt/openqa-scripts/openqa-iso-sync-sles: line 42:  4604 Terminated              ${0%/*}/rsync.pl --host openqa.suse.de --verbose $rsync_opts $p >> /var/log/openqa_rsync.log 2>&1
Thu Apr 12 06:25:01 CEST 2018
Thu Apr 12 06:25:01 CEST 2018
Configured to deprioritize or cancel jobs from previous builds
Actions #6

Updated by okurz almost 6 years ago

OK, so what do you suggest?

Actions #7

Updated by riafarov almost 6 years ago

Do we remove the lock when fail to sync after 1h? I guess we could at least do that, so will be able to retry.

Actions #8

Updated by mloviska almost 6 years ago

I do not really see any explicit file lock/unlock in the code. Maybe it is done by bsdtar ? But why it would lock "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock" ?

Actions #9

Updated by mloviska almost 6 years ago

What about to define and locktimeout for /var/lib/openqa/.config/openqa/openqa-iso-sync.lock in openqa-iso-sync-sles script?

if ! lockfile -r 1 -l 7200 /var/lib/openqa/.config/openqa/openqa-iso-sync.lock; then
    echo "can't get lock file"
    exit 0
fi

@okurz, @riafarov: WDYT ?

Actions #10

Updated by mloviska almost 6 years ago

  • Status changed from Workable to Feedback
Actions #11

Updated by okurz almost 6 years ago

mloviska wrote:

I do not really see any explicit file lock/unlock in the code. Maybe it is done by bsdtar ? But why it would lock "/var/lib/openqa/.config/openqa/openqa-iso-sync.lock" ?

I doubt bsdtar is mangling with the lock.

mloviska wrote:

What about to define and locktimeout for /var/lib/openqa/.config/openqa/openqa-iso-sync.lock in openqa-iso-sync-sles script?

hm, I don't know if this would help. Could introduce even more problems.

Actions #12

Updated by mloviska almost 6 years ago

bsdtar is definitely not the one to blame. But on the other hand rsync.pl has already set an timeout for IO. According to the logs it looks like the file transfer was not an issue, otherwise it would timeout after an hour. IMHO the root cause is most likely the lock file.

Actions #13

Updated by mloviska almost 6 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF