Project

General

Profile

action #43934

Manage o3 infrastructure with salt again

Added by okurz about 2 years ago. Updated about 2 months ago.

Status:
Workable
Priority:
Low
Assignee:
-
Target version:
Start date:
2018-11-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

See #43823#note-1 . Previously we had a salt-minion on each worker even though no salt recipes were used, at least we used salt for structured remote execution ;)

Expected result

As salt was there, is the preferred system management solution, and should be extended to have full recipes we should have a salt-minion available as well on all the workers.

To be covered for o3 in system management, e.g. salt states

  • aarch64 irqbalance workaround #53573
  • hugepages workaround #53234
  • ppc kvm permissions #25170

Related issues

Related to openQA Infrastructure - action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hostsResolved2018-11-15

Related to openQA Infrastructure - action #44066: merge the two osd salt git reposRejected2018-11-20

Related to openQA Infrastructure - action #53573: Failed service "irqbalance" on aarch64.o.oResolved2019-06-30

Copied from openQA Infrastructure - action #43823: o3 workers immediately incompleting all jobs, caching service can not be reachedResolved2018-11-15

History

#1 Updated by okurz about 2 years ago

  • Copied from action #43823: o3 workers immediately incompleting all jobs, caching service can not be reached added

#2 Updated by okurz about 2 years ago

  • Copied to action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts added

#3 Updated by RBrownSUSE about 2 years ago

Indeed, this is intentional, but temporarily intentional

Given Salt's habit of exploding spectacularly when the master is not updated but the minions are, and given we patched the minions first and they now auto update, it would be suicidal to have a master running on leap 42.3 o3 talking to minions running on Leap 15.0 workers.

When o3 is also Leap 15 and updated at least as frequently as the workers, then this makes sense, thanks for tracking the item :)

#4 Updated by okurz about 2 years ago

I see, so I created a new ticket #43976 to cover that.

#5 Updated by nicksinger about 2 years ago

  • Status changed from New to Blocked

#6 Updated by nicksinger about 2 years ago

  • Copied to deleted (action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts)

#7 Updated by nicksinger about 2 years ago

  • Blocks action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts added

#8 Updated by okurz over 1 year ago

  • Blocks deleted (action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts)

#9 Updated by okurz over 1 year ago

  • Related to action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts added

#10 Updated by okurz over 1 year ago

  • Status changed from Blocked to Workable

I think nicksinger got it upside down with the blocks/blocked however I do not see that this ticket here is strongly blocking/blocked, a strong relationship, sure.

#11 Updated by okurz over 1 year ago

  • Subject changed from salt is gone from o3 workers? to Manage o3 infrastructure with salt again
  • Priority changed from Normal to Low

Because salt is not currently used within the o3 infrastructure and because of error messages within the journal on o3 for now I disabled the salt master on o3 as well with systemctl disable --now salt-master to prevent the error message "Exception during resolving address: [Errno 1] Unknown host"

#12 Updated by okurz over 1 year ago

  • Related to action #44066: merge the two osd salt git repos added

#13 Updated by okurz over 1 year ago

  • Blocks action #53573: Failed service "irqbalance" on aarch64.o.o added

#14 Updated by okurz about 1 year ago

  • Blocks deleted (action #53573: Failed service "irqbalance" on aarch64.o.o)

#15 Updated by okurz about 1 year ago

  • Related to action #53573: Failed service "irqbalance" on aarch64.o.o added

#16 Updated by okurz about 1 year ago

  • Description updated (diff)

#17 Updated by okurz 12 months ago

had a chat with lrupp/kl_eisbaer: He made me aware of https://build.opensuse.org/package/show/OBS:Server:Unstable/OBS-WorkerOnly which provides images that are used by OBS workers.

The images are loaded by PXE, the PXE config points to an image - and this image path is a symlink here. Allows to easily switch from one image to the other, if something is identified as broken.

[04/02/2020 10:20:36] <kl_eisbaer> the most funny part is the one that adjusts the worker after the PXE boot.
[04/02/2020 10:20:57] <kl_eisbaer> here we use a script in the init phase, that downloads files/settings from the server
[04/02/2020 10:21:22] <kl_eisbaer> with this, we can adjust the configuration of the worker (how many parallel builds, how much disk space, etc)
[04/02/2020 10:21:44] <kl_eisbaer> okurz: if you are interested, I can give you a short introduction into our setup
[04/02/2020 10:22:36] <kl_eisbaer> ...which even allows us to move workers between OBS/IBS via script since we have access to the switches :-)

#18 Updated by okurz 11 months ago

My current proposal is the following:

  • Ensure salt-minion on all o3 workers
  • Ensure salt-master on o3
  • Ensure workers are connected to o3 and salt key is accepted
  • Move gitlab.suse.de/openqa/salt-states-openqa to github, e.g. in https://github.com/os-autoinst scope, and create back-mirror into salt-states repo or get rid of it completely

Anyone sees problems with this approach?

#19 Updated by okurz 3 months ago

  • Target version set to Ready

#20 Updated by nicksinger 2 months ago

RBrownSUSE wrote:

Indeed, this is intentional, but temporarily intentional

Given Salt's habit of exploding spectacularly when the master is not updated but the minions are, and given we patched the minions first and they now auto update, it would be suicidal to have a master running on leap 42.3 o3 talking to minions running on Leap 15.0 workers.

When o3 is also Leap 15 and updated at least as frequently as the workers, then this makes sense, thanks for tracking the item :)

okurz this still applies somewhat. While o3 is on 15.2 in the meantime it still needs manual updates. This raises a few points from my side:

  • Isn't the topic "install salt" blocked by the migration of o3 onto transnational servers?
  • If salt explodes that spectacularly with non-matching versions; should we maybe look into something like ansible?
  • Should we at least deploy a ssh-key onto ariel which can access all workers over ssh and install something like pssh (https://linux.die.net/man/1/pssh)?

#21 Updated by RBrownSUSE 2 months ago

nicksinger wrote:

okurz this still applies somewhat. While o3 is on 15.2 in the meantime it still needs manual updates. This raises a few points from my side:

  • Isn't the topic "install salt" blocked by the migration of o3 onto transnational servers?

Saltstack supports transactional systems meanwhile - https://github.com/openSUSE/salt/pull/271

  • If salt explodes that spectacularly with non-matching versions; should we maybe look into something like ansible?

Salt can run masterless, in which case the versions are unrelated

#22 Updated by nicksinger 2 months ago

RBrownSUSE wrote:

nicksinger wrote:

okurz this still applies somewhat. While o3 is on 15.2 in the meantime it still needs manual updates. This raises a few points from my side:

  • Isn't the topic "install salt" blocked by the migration of o3 onto transnational servers?

Saltstack supports transactional systems meanwhile - https://github.com/openSUSE/salt/pull/271

kind of https://github.com/saltstack/salt/pull/58520 ;)
It also doesn't solve the problem of o3 being upgraded manually so version differences can still happen. Masterless salt is an interesting point you have there. Can it cover our (current) main use-case: execute commands on multiple hosts?

#23 Updated by okurz about 2 months ago

I would not be concerned with version differences until I see that failing. And running salt-master (again) on o3 and a salt-minion on each worker, accepting salt keys, and then simply use it for distributed command execution, e.g. cmd.run is a good start.

#24 Updated by okurz about 2 months ago

  • Estimated time set to 80142.00 h

#25 Updated by okurz about 2 months ago

  • Estimated time deleted (80142.00 h)

Also available in: Atom PDF