Project

General

Profile

Actions

action #68164

closed

https://github.com/os-autoinst/openQA/pull/3177 caused a regression, the cacheservice minion systemd service could not start

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2020-06-17
Due date:
% Done:

0%

Estimated time:

Description

Observation

reported in irc room [#opensuse-factory](irc://chat.freenode.net/opensuse-factory) :

[17/06/2020 09:02:45] <Dimstar> Good morning all; anybody knows what's up with o3 not picking up the scheduled jobs?
[17/06/2020 09:04:44] <guillaume_g> Dimstar: Hi! :) Workers are reported broken
[17/06/2020 09:05:00] <guillaume_g> Dimstar: "No workers active in the cache service"
[17/06/2020 09:05:30] <guillaume_g> the only workers which are running are the one which are not auto-updated ;)
[17/06/2020 09:07:31] <guillaume_g> kraih: ^ could it be your MR https://github.com/os-autoinst/openQA/pull/3177 ?
[17/06/2020 09:07:32] <|Anna|> Github project os-autoinst/openQA pull request#3177: "Reset locks when restarting the cache service Minion worker", created on 2020-06-16, status: closed on 2020-06-16, https://github.com/os-autoinst/openQA/pull/3177
[17/06/2020 09:08:09] <fvogt> At least on openqa-aarch64 all services are up and running, according to systemctl
[17/06/2020 09:09:40] <guillaume_g> Dimstar: Could you abort openSUSE:Factory:ARM:Live/JeOS:GNOME-efi.aarch64 please?
[17/06/2020 09:09:50] <Dimstar> fun - worker info for e.g. ow1:1 is alive, last seen less than a minute ago, broken
[17/06/2020 09:10:06] <Dimstar> guillaume_g: done
[17/06/2020 09:12:02] <fvogt> openqa-worker-cacheservice-minion.service is dead - it printed usage info...
[17/06/2020 09:12:09] <fvogt> " See 'APPLICATION help COMMAND' for more information on a specific command."
[17/06/2020 09:12:24] <fvogt> For some reason that has exit code 0, which isn't helpful
[17/06/2020 09:13:49] <guillaume_g> Dimstar: thanks! :)
[17/06/2020 09:14:07] <fvogt> It's the order of arguments
[17/06/2020 09:14:15] <fvogt> It has to be "run -m production", not "-m production run"
[17/06/2020 09:16:41] <fvogt> Started it manually, worker is back. So confirmed to be that indeed
[17/06/2020 09:23:58] <guillaume_g> Great!
[17/06/2020 09:24:22] <fvogt> Now we just need someone to commit and push the fix
[17/06/2020 09:24:53] <Dimstar> fvogt: did you restart all workers for this? e.g. ow1, ow4 ow7, imagetester?
[17/06/2020 09:25:19] <fvogt> Dimstar: Where happened to your 'S'?
[17/06/2020 09:25:30] <fvogt> No, I only tried to prove the theory on openqa-aarch64
[17/06/2020 09:25:58] <fvogt> You can run su _openqa-worker -c '/usr/share/openqa/script/openqa-workercache run -m production --reset-locks' if you want to
[17/06/2020 09:26:17] <Dimstar> fvogt: ok; that's fine; just needed to know... I'll kick the x86_64 workers

from openqaworker13 within osd infrastructure:

Jun 17 09:48:57 openqaworker13 systemd[1]: Started OpenQA Worker Cache Service Minion.
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]: [22459] [i] [0oHtg3mJ] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]: Usage: APPLICATION COMMAND [OPTIONS]
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   mojo version
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   mojo generate lite-app
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   ./myapp.pl daemon -m production -l http://*:8080
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   ./myapp.pl get /foo
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   ./myapp.pl routes -v
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]: Tip: CGI and PSGI environments can be automatically detected very often and
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:      work without commands.
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]: Options (for all commands):
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   -h, --help          Get more information on a specific command
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:       --home <path>   Path to home directory of your application, defaults to
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:                       the value of MOJO_HOME or auto-detection
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:   -m, --mode <name>   Operating mode for your application, defaults to the
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:                       value of MOJO_MODE/PLACK_ENV or "development"
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]: Commands:
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  cgi       Start application with CGI
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  cpanify   Upload distribution to CPAN
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  daemon    Start application with HTTP and WebSocket server
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  eval      Run code against application
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  generate  Generate files and directories from templates
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  get       Perform HTTP request
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  inflate   Inflate embedded files to real files
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  minion    Minion job queue
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  prefork   Start application with pre-forking HTTP and WebSocket server
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  psgi      Start application with PSGI
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  routes    Show available routes
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  run       Start Minion worker
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]:  version   Show versions of available modules
Jun 17 09:48:58 openqaworker13 openqa-worker-cacheservice-minion[22459]: See 'APPLICATION help COMMAND' for more information on a specific command.

so the service exits with success showing the help

Lessons learned + TODOs

  • Ask explicitly how changes to systemd files have been tested
  • Add tests for systemd services and/or the daemon wrapper scripts -> #68167
  • Prevent wrong arguments exiting the service with success -> #68167
Actions #1

Updated by okurz almost 4 years ago

  • Description updated (diff)

After merging https://github.com/os-autoinst/openQA/pull/3182 and waiting for packages to be built in https://build.opensuse.org/package/show/devel:openQA/openQA I triggered on o3:

for i in openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "(transactional-update -n dup || zypper -n dup) && reboot" ; done

EDIT: I forgot aarch64. Fixed that manually. The above command should be

for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "(transactional-update -n dup || zypper -n dup) && reboot" ; done
Actions #2

Updated by okurz almost 4 years ago

  • Description updated (diff)
Actions #3

Updated by okurz almost 4 years ago

For osd workers I applied the fix with

sudo salt -l error -C 'G@roles:worker' cmd.run 'zypper -n in openQA-worker && systemctl start openqa-worker-cacheservice-minion.service'

sent a message as reply to the email announcement on openqa@suse.de . https://openqa.suse.de/tests looks sane, tests have been picked up.

Actions #4

Updated by okurz almost 4 years ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback

created new ticket #68167 for the two points "Add tests for systemd services and/or the daemon wrapper scripts" and "Prevent wrong arguments exiting the service with success"

Actions #5

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

No more problems reported by users, tests are running fine on both o3 and osd, no new grafana alerts. Actually https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=queries&orgId=1&panelId=17&fullscreen&edit&refresh=30s&from=now-6h&to=now looks good again. This seems to have a relation to what I did but https://openqa.suse.de/minion/workers (still?) shows no "active" and no "inactive" workers

Actions

Also available in: Atom PDF