Project

General

Profile

Actions

action #41891

closed

action #39980: Cache locks assets when worker dies in critical section

Worker stalled during flock

Added by szarate over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-10-02
Due date:
% Done:

0%

Estimated time:

Description

I guess this is the next in the cache deadlocking itself series

openqaworker13:/var/lib/openqa/pool/1 # systemctl status openqa-worker@2
openqa-worker@2.service - openQA Worker #2
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-09-27 11:33:10 CEST; 4 days ago
Main PID: 11928 (worker)
Tasks: 3 (limit: 512)
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@2.service
├─11928 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
├─12344 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
└─12345 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/

Sep 28 00:04:57 openqaworker13 worker[11928]: GLOB(0x47cc318)[error] Could not unlink 'serial0.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [error] Could not unlink 'autoinst-log.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [error] Could not unlink 'serial_terminal.txt': No such file or directory
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] got job 2102868: 02102868-sle-15-SP1-Installer-DVD-x86_64-Build52.3-nfv_trafficgen@ipmi-64bit-mlx_con5
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] +++ setup notes +++
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] start time: 2018-09-27 22:04:57
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] running on openqaworker13:2 (Linux 4.4.140-62-default #1 SMP Tue Jul 17 12:14:34 UTC 2018 (951943d) x86_64)
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] OpenQA::Worker::Cache: Initialized with openqa.suse.de at /var/lib/openqa/cache, current size is 50592566784
Sep 28 00:04:57 openqaworker13 worker[11928]: [info] Downloading SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso from openqa.suse.de/tests/2102868/asset/iso/SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso

_openqa+ 3928 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 7928 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 11342 0.0 0.0 8752 1448 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 12339 0.0 0.0 8752 1452 ? S Sep28 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 12345 0.0 0.0 8752 1480 ? S Sep28 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 16900 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 19441 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 19696 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 24351 0.0 0.0 8752 1452 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
_openqa+ 24673 0.0 0.0 8752 1480 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/
root 27404 0.0 0.0 9292 1628 pts/0 S+ 10:48 0:00 grep --color=auto flock
_openqa+ 29244 0.0 0.0 8752 1476 ? S Sep27 0:00 flock /var/lib/openqa/cache/needleslock rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/tests/

Sep 28 00:04:57 openqaworker13 worker[11928]: [info] CACHE: updating the /var/lib/openqa/cache/SLE-15-SP1-Installer-DVD-x86_64-Build52.3-Media1.iso last usage

Actions #1

Updated by coolo over 5 years ago

So what held the lock?

Actions #2

Updated by szarate over 5 years ago

Calls to rsync command:

openqaworker13:/var/lib/openqa/cache # strace -s 300 -p 1381

Process 1381 attached
write(1, "\r            209 100%    0.26kB/s    0:00:00 (xfr#291, ir-chk=1027/4095)\n", 73

and it's hanging there

Actions #3

Updated by szarate over 5 years ago

Note: I killed the flocks before >.< but this looks like the same simptom I would say.

Actions #4

Updated by coolo over 5 years ago

rsync hangs like this:

Process 1381 attached
write(1, "\r            209 100%    0.26kB/"..., 73

Meaning the worker doesn't read the output of the rsync process

Actions #5

Updated by coolo over 5 years ago

the worker only waits for rsync to finish:

openqaworker13:/var/lib/openqa/pool # strace -p 28257 
Process 28257 attached
wait4(28258, 

28258 is the flock.

Actions #6

Updated by coolo over 5 years ago

  • Priority changed from Normal to High
  • Target version changed from Ready to Current Sprint
Actions #7

Updated by coolo over 5 years ago

It appears to me that the output of rsync fit into the buffer limits previously - and we just happen too many needles by now so it stalls. The fix for this seems to be the parent processing fetching the output. We want the output of rsync to end in worker.txt and so we stall.

Actions #8

Updated by EDiGiacinto over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to EDiGiacinto
  • Parent task set to #39980

Working on it, as part of migrating the cache as a separate service.

Actions #9

Updated by EDiGiacinto over 5 years ago

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/1783 has been merged, setting to Feedback. Now the rsync phase is a background task of Minion

Actions #10

Updated by coolo over 5 years ago

  • Status changed from Feedback to Resolved
  • Target version changed from Current Sprint to Done
Actions

Also available in: Atom PDF