action #41891
closedaction #39980: Cache locks assets when worker dies in critical section
Worker stalled during flock
0%
Description
I guess this is the next in the cache deadlocking itself
series
openqaworker13:/var/lib/openqa/pool/1 # systemctl status openqa-worker@2
● openqa-worker@2.service - openQA Worker #2
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-09-27 11:33:10 CEST; 4 days ago
Main PID: 11928 (worker)
Tasks: 3 (limit: 512)
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@2.service
├─11928 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
├─12344 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
└─12345 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
Sep 28 00:04:57 openqaworker13 worker[11928]: GLOB(0x47cc318)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] got job 2102868: 02102868-sle-15-SP1-Installer-DVD-x86_64-Build52.3-nfv_trafficgen@ipmi-64bit-mlx_con5
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] +++ setup notes +++
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] start time: 2018-09-27 22:04:57
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] running on openqaworker13:2 (Linux 4.4.140-62-default #1 SMP Tue Jul 17 12:14:34 UTC 2018 (951943d) x86_64)
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] OpenQA::Worker::Cache: Initialized with openqa.suse.de at /var/lib/openqa/cache, current size is 50592566784
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] Downloading SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso from openqa.suse.de/tests/2102868/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso
_openqa+ 3928 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 7928 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 11342 0.0 0.0 8752 1448 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 12339 0.0 0.0 8752 1452 ? S Sep28 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 12345 0.0 0.0 8752 1480 ? S Sep28 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 16900 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 19441 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 19696 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 24351 0.0 0.0 8752 1452 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 24673 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
root 27404 0.0 0.0 9292 1628 pts/0 S+ 10:48 0:00 grep --color=auto flock
_openqa+ 29244 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] CACHE: updating the /var/lib/openqa/cache/SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso last usage
Updated by szarate about 6 years ago
Calls to rsync command:
openqaworker13:/var/lib/openqa/cache # strace -s 300 -p 1381
Process 1381 attached
write(1, "\r 209 100% 0.26kB/s 0:00:00 (xfr#291, ir-chk=1027/4095)\n", 73
and it's hanging there
Updated by szarate about 6 years ago
Note: I killed the flocks before >.< but this looks like the same simptom I would say.
Updated by coolo about 6 years ago
rsync hangs like this:
Process 1381 attached
write(1, "\r 209 100% 0.26kB/"..., 73
Meaning the worker doesn't read the output of the rsync process
Updated by coolo about 6 years ago
the worker only waits for rsync to finish:
openqaworker13:/var/lib/openqa/pool # strace -p 28257
Process 28257 attached
wait4(28258,
28258 is the flock.
Updated by coolo about 6 years ago
- Priority changed from Normal to High
- Target version changed from Ready to Current Sprint
Updated by coolo about 6 years ago
It appears to me that the output of rsync fit into the buffer limits previously - and we just happen too many needles by now so it stalls. The fix for this seems to be the parent processing fetching the output. We want the output of rsync to end in worker.txt and so we stall.
Updated by EDiGiacinto about 6 years ago
- Status changed from New to In Progress
- Assignee set to EDiGiacinto
- Parent task set to #39980
Working on it, as part of migrating the cache as a separate service.
Updated by EDiGiacinto about 6 years ago
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback. Now the rsync phase is a background task of Minion
Updated by coolo almost 6 years ago
- Status changed from Feedback to Resolved
- Target version changed from Current Sprint to Done