action #41891

action #39980: Cache locks assets when worker dies in critical section

Worker stalled during flock

Added by szarate over 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:02/10/2018
Priority:HighDue date:
Assignee:EDiGiacinto% Done:

0%

Category:Feature requests
Target version:Done
Difficulty:
Duration:

Description

I guess this is the next in the cache deadlocking itself series

openqaworker13:/var/lib/openqa/pool/1 # systemctl status openqa-worker@2
openqa-worker@2.service - openQA Worker #2
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-09-27 11:33:10 CEST; 4 days ago
Main PID: 11928 (worker)
Tasks: 3 (limit: 512)
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@2.service
├─11928 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
├─12344 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
└─12345 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/

Sep 28 00:04:57 openqaworker13 worker[11928]: GLOB(0x47cc318)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] got job 2102868: 02102868-sle-15-SP1-Installer-DVD-x86_64-Build52.3-nfv_trafficgen@ipmi-64bit-mlx_con5
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] +++ setup notes +++
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] start time: 2018-09-27 22:04:57
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] running on openqaworker13:2 (Linux 4.4.140-62-default #1 SMP Tue Jul 17 12:14:34 UTC 2018 (951943d) x86_64)
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] OpenQA::Worker::Cache: Initialized with openqa.suse.de at /var/lib/openqa/cache, current size is 50592566784
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] Downloading SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso from openqa.suse.de/tests/2102868/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso

_openqa+ 3928 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 7928 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 11342 0.0 0.0 8752 1448 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 12339 0.0 0.0 8752 1452 ? S Sep28 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 12345 0.0 0.0 8752 1480 ? S Sep28 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 16900 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 19441 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 19696 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 24351 0.0 0.0 8752 1452 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 24673 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
root 27404 0.0 0.0 9292 1628 pts/0 S+ 10:48 0:00 grep --color=auto flock
_openqa+ 29244 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/

Sep 28 00:04:57 openqaworker13 worker[11928]: [info] CACHE: updating the /var/lib/openqa/cache/SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso last usage

History

#1 Updated by coolo over 1 year ago

So what held the lock?

#2 Updated by szarate over 1 year ago

Calls to rsync command:

openqaworker13:/var/lib/openqa/cache # strace -s 300 -p 1381

Process 1381 attached
write(1, "\r            209 100%    0.26kB/s    0:00:00 (xfr#291, ir-chk=1027/4095)\n", 73

and it's hanging there

#3 Updated by szarate over 1 year ago

Note: I killed the flocks before >.< but this looks like the same simptom I would say.

#4 Updated by coolo over 1 year ago

rsync hangs like this:

Process 1381 attached
write(1, "\r            209 100%    0.26kB/"..., 73

Meaning the worker doesn't read the output of the rsync process

#5 Updated by coolo over 1 year ago

the worker only waits for rsync to finish:

openqaworker13:/var/lib/openqa/pool # strace -p 28257
Process 28257 attached
wait4(28258,

28258 is the flock.

#6 Updated by coolo over 1 year ago

  • Priority changed from Normal to High
  • Target version changed from Ready to Current Sprint

#7 Updated by coolo over 1 year ago

It appears to me that the output of rsync fit into the buffer limits previously - and we just happen too many needles by now so it stalls. The fix for this seems to be the parent processing fetching the output. We want the output of rsync to end in worker.txt and so we stall.

#8 Updated by EDiGiacinto over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto
  • Parent task set to #39980

Working on it, as part of migrating the cache as a separate service.

#9 Updated by EDiGiacinto over 1 year ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback. Now the rsync phase is a background task of Minion

#10 Updated by coolo about 1 year ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done

Also available in: Atom PDF