action #64129
openSet `$0` for upload process to something more explicit (was: Duplicate worker instances competing)
0%
Description
Observation¶
Seems like since today we have multiple worker instances with the same instance number causing havoc.
From o3:
$ for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "ps aux|grep 'script/worker'|sort -k14"; done
aarch64
…
openqaworker1
…
_openqa+ 2833 0.2 0.2 673740 591272 ? Ss 03:34 1:41 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+ 2832 0.1 0.1 410048 327280 ? Ss 03:34 1:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
_openqa+ 2842 0.1 0.1 360052 277376 ? Ss 03:34 0:58 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+ 24314 17.4 0.1 423588 336580 ? S 15:40 1:24 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+ 2843 0.1 0.1 411296 328708 ? Ss 03:34 0:59 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+ 2835 0.1 0.1 377136 294516 ? Ss 03:34 1:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+ 2841 0.1 0.1 491764 409008 ? Ss 03:34 1:13 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+ 2827 0.1 0.1 429852 347348 ? Ss 03:34 1:04 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+ 2836 0.1 0.1 361520 279244 ? Ss 03:34 1:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 9
openqaworker4
…
openqaworker7
_openqa+ 3292 0.0 0.1 371188 289416 ? Ss 03:37 0:42 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+ 18466 1.5 0.1 538604 451672 ? S 15:46 0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+ 3288 0.1 0.1 538604 456548 ? Ss 03:37 0:58 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+ 3281 0.0 0.1 406300 324768 ? Ss 03:37 0:37 /usr/bin/perl /usr/share/openqa/script/worker --instance 11
_openqa+ 3289 0.0 0.1 386092 304688 ? Ss 03:37 0:40 /usr/bin/perl /usr/share/openqa/script/worker --instance 12
similar in osd. Reported by dzedro in #63853
Files
Updated by okurz almost 5 years ago
- File journal_poo63853.xz journal_poo63853.xz added
on osd salt -l error --no-color -C 'G@roles:worker' --state-output=changes cmd.run "ps aux|grep script/worker|sort -k 14"
from osd reveals more, e.g. on openqaworker8.
systemctl status openqa-worker@1
reveals that the duplicate "instance 1" is tracked within the same systemd service, at least:
# systemctl status openqa-worker@1
● openqa-worker@1.service - openQA Worker #1
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: active (running) since Tue 2020-03-03 14:17:58 CET; 1h 22min ago
Process: 48262 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
Main PID: 48263 (worker)
Tasks: 2
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@1.service
├─48263 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
└─55122 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
Mar 03 15:40:45 openqaworker8 worker[48263]: [info] [pid:55122] SLES-12-SP4-x86_64-mru-install-minimal-with-addons-Build:13917:adcli-Server-DVD-Incidents-64bit.qcow2: Processing chunk 291>
…
Something has started the new process, probably some salt deployment but the old process seems to have been not properly stopped.
A journal covering the restart of openQA worker instance service and the PIDs in question is added.
Updated by okurz almost 5 years ago
- Related to action #63853: [tools] broken /etc/sysconfig/network/ifcfg-br1 added
Updated by okurz almost 5 years ago
Btw, we use subprocesses e.g. for uploading. These are not problematic. Still, to me the log in https://progress.opensuse.org/attachments/download/9559/journal_poo63853.xz however looks like something is mangled.
Updated by mkittler almost 5 years ago
Which place in the log do you mean exactly?
Updated by okurz almost 5 years ago
- Subject changed from Duplicate worker instances competing to Set `$0` for upload process to something more explicit (was: Duplicate worker instances competing)
- Category changed from Regressions/Crashes to Feature requests
- Priority changed from Urgent to Low
mkittler, sebastianriedel and me looked into the log files and we found nothing that could explain network problems in multi-machine tests. Everything seems to be in order. Multiple worker instances would try to acquire the same lock before doing anything with qemu that then tries to acquire network. What is visible in the process tree are the upload processes.
- Idea for improvement: Set
$0
for upload process to something more explicit
The original output from dzedro was confusing me because he didn't use the "-f" parameter for ps
so it looked like these "duplicate processes" are parallel when in reality one is a subprocess of the other, just with the same name. This also looks a bit weird in systemctl status
as it can also show two processes with exactly same cmd line next to each other on the same level.