Project

General

Profile

Actions

action #64129

open

Set `$0` for upload process to something more explicit (was: Duplicate worker instances competing)

Added by okurz almost 5 years ago. Updated over 4 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
QA (public, currently private due to #173521) - future
Start date:
2020-03-03
Due date:
% Done:

0%

Estimated time:

Description

Observation

Seems like since today we have multiple worker instances with the same instance number causing havoc.

From o3:

$ for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "ps aux|grep 'script/worker'|sort -k14"; done                 
aarch64
…
openqaworker1
…
_openqa+  2833  0.2  0.2 673740 591272 ?       Ss   03:34   1:41 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+  2832  0.1  0.1 410048 327280 ?       Ss   03:34   1:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
_openqa+  2842  0.1  0.1 360052 277376 ?       Ss   03:34   0:58 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+ 24314 17.4  0.1 423588 336580 ?       S    15:40   1:24 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+  2843  0.1  0.1 411296 328708 ?       Ss   03:34   0:59 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+  2835  0.1  0.1 377136 294516 ?       Ss   03:34   1:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+  2841  0.1  0.1 491764 409008 ?       Ss   03:34   1:13 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+  2827  0.1  0.1 429852 347348 ?       Ss   03:34   1:04 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+  2836  0.1  0.1 361520 279244 ?       Ss   03:34   1:00 /usr/bin/perl /usr/share/openqa/script/worker --instance 9
openqaworker4
…
openqaworker7
_openqa+  3292  0.0  0.1 371188 289416 ?       Ss   03:37   0:42 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+ 18466  1.5  0.1 538604 451672 ?       S    15:46   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+  3288  0.1  0.1 538604 456548 ?       Ss   03:37   0:58 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+  3281  0.0  0.1 406300 324768 ?       Ss   03:37   0:37 /usr/bin/perl /usr/share/openqa/script/worker --instance 11
_openqa+  3289  0.0  0.1 386092 304688 ?       Ss   03:37   0:40 /usr/bin/perl /usr/share/openqa/script/worker --instance 12

similar in osd. Reported by dzedro in #63853


Files

journal_poo63853.xz (15.5 KB) journal_poo63853.xz okurz, 2020-03-03 14:52

Related issues 1 (0 open1 closed)

Related to openQA Tests (public) - action #63853: [tools] broken /etc/sysconfig/network/ifcfg-br1Resolvedokurz2020-02-26

Actions
Actions #1

Updated by okurz almost 5 years ago

on osd salt -l error --no-color -C 'G@roles:worker' --state-output=changes cmd.run "ps aux|grep script/worker|sort -k 14" from osd reveals more, e.g. on openqaworker8.

systemctl status openqa-worker@1 reveals that the duplicate "instance 1" is tracked within the same systemd service, at least:

# systemctl status openqa-worker@1
● openqa-worker@1.service - openQA Worker #1
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: active (running) since Tue 2020-03-03 14:17:58 CET; 1h 22min ago
  Process: 48262 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
 Main PID: 48263 (worker)
    Tasks: 2
   CGroup: /openqa.slice/openqa-worker.slice/openqa-worker@1.service
           ├─48263 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
           └─55122 /usr/bin/perl /usr/share/openqa/script/worker --instance 1

Mar 03 15:40:45 openqaworker8 worker[48263]: [info] [pid:55122] SLES-12-SP4-x86_64-mru-install-minimal-with-addons-Build:13917:adcli-Server-DVD-Incidents-64bit.qcow2: Processing chunk 291>
…

Something has started the new process, probably some salt deployment but the old process seems to have been not properly stopped.

A journal covering the restart of openQA worker instance service and the PIDs in question is added.

Actions #2

Updated by okurz almost 5 years ago

  • Related to action #63853: [tools] broken /etc/sysconfig/network/ifcfg-br1 added
Actions #3

Updated by okurz almost 5 years ago

Btw, we use subprocesses e.g. for uploading. These are not problematic. Still, to me the log in https://progress.opensuse.org/attachments/download/9559/journal_poo63853.xz however looks like something is mangled.

Actions #4

Updated by mkittler almost 5 years ago

Which place in the log do you mean exactly?

Actions #5

Updated by okurz almost 5 years ago

  • Subject changed from Duplicate worker instances competing to Set `$0` for upload process to something more explicit (was: Duplicate worker instances competing)
  • Category changed from Regressions/Crashes to Feature requests
  • Priority changed from Urgent to Low

mkittler, sebastianriedel and me looked into the log files and we found nothing that could explain network problems in multi-machine tests. Everything seems to be in order. Multiple worker instances would try to acquire the same lock before doing anything with qemu that then tries to acquire network. What is visible in the process tree are the upload processes.

  • Idea for improvement: Set $0 for upload process to something more explicit

The original output from dzedro was confusing me because he didn't use the "-f" parameter for ps so it looked like these "duplicate processes" are parallel when in reality one is a subprocess of the other, just with the same name. This also looks a bit weird in systemctl status as it can also show two processes with exactly same cmd line next to each other on the same level.

Actions #6

Updated by okurz over 4 years ago

  • Target version set to future
Actions

Also available in: Atom PDF