Project

General

Profile

Actions

coordination #43934

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[epic] Manage o3 infrastructure with salt again

Added by okurz about 6 years ago. Updated almost 2 years ago.

Status:
Blocked
Priority:
Low
Assignee:
Category:
Organisational
Target version:
QA (public, currently private due to #173521) - future
Start date:
2021-03-16
Due date:
% Done:

33%

Estimated time:
(Total: 0.00 h)

Description

Observation

See #43823#note-1 . Previously we had a salt-minion on each worker even though no salt recipes were used, at least we used salt for structured remote execution ;)

Expected result

As salt was there, is the preferred system management solution, and should be extended to have full recipes we should have a salt-minion available as well on all the workers.

To be covered for o3 in system management, e.g. salt states

  • aarch64 irqbalance workaround #53573
  • hugepages workaround #53234
  • ppc kvm permissions #25170

Subtasks 3 (2 open1 closed)

action #90164: Make gitlab.suse.de/openqa/salt-states-openqa publicResolvedokurz2021-03-16

Actions
action #90167: Setup initial salt infrastructure for remote management within o3New2021-03-16

Actions
action #91332: Allow to contribute to salt-states-openqa over githubWorkable2021-04-19

Actions

Related issues 4 (0 open4 closed)

Related to openQA Infrastructure (public) - action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hostsResolvedokurz2018-11-15

Actions
Related to openQA Infrastructure (public) - action #44066: merge the two osd salt git reposRejectedokurz2018-11-20

Actions
Related to openQA Infrastructure (public) - action #53573: Failed service "irqbalance" on aarch64.o.oResolvedokurz2019-06-30

Actions
Copied from openQA Infrastructure (public) - action #43823: o3 workers immediately incompleting all jobs, caching service can not be reachedResolvedokurz2018-11-15

Actions
Actions #1

Updated by okurz about 6 years ago

  • Copied from action #43823: o3 workers immediately incompleting all jobs, caching service can not be reached added
Actions #2

Updated by okurz about 6 years ago

  • Copied to action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts added
Actions #3

Updated by RBrownSUSE about 6 years ago

Indeed, this is intentional, but temporarily intentional

Given Salt's habit of exploding spectacularly when the master is not updated but the minions are, and given we patched the minions first and they now auto update, it would be suicidal to have a master running on leap 42.3 o3 talking to minions running on Leap 15.0 workers.

When o3 is also Leap 15 and updated at least as frequently as the workers, then this makes sense, thanks for tracking the item :)

Actions #4

Updated by okurz about 6 years ago

I see, so I created a new ticket #43976 to cover that.

Actions #5

Updated by nicksinger about 6 years ago

  • Status changed from New to Blocked
Actions #6

Updated by nicksinger about 6 years ago

  • Copied to deleted (action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts)
Actions #7

Updated by nicksinger about 6 years ago

  • Blocks action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts added
Actions #8

Updated by okurz over 5 years ago

  • Blocks deleted (action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts)
Actions #9

Updated by okurz over 5 years ago

  • Related to action #43937: align o3 workers (done: "imagetester" and) "power8" with the others which are currently "transactional-update" hosts added
Actions #10

Updated by okurz over 5 years ago

  • Status changed from Blocked to Workable

I think @nicksinger got it upside down with the blocks/blocked however I do not see that this ticket here is strongly blocking/blocked, a strong relationship, sure.

Actions #11

Updated by okurz over 5 years ago

  • Subject changed from salt is gone from o3 workers? to Manage o3 infrastructure with salt again
  • Priority changed from Normal to Low

Because salt is not currently used within the o3 infrastructure and because of error messages within the journal on o3 for now I disabled the salt master on o3 as well with systemctl disable --now salt-master to prevent the error message "Exception during resolving address: [Errno 1] Unknown host"

Actions #12

Updated by okurz about 5 years ago

  • Related to action #44066: merge the two osd salt git repos added
Actions #13

Updated by okurz about 5 years ago

  • Blocks action #53573: Failed service "irqbalance" on aarch64.o.o added
Actions #14

Updated by okurz about 5 years ago

  • Blocks deleted (action #53573: Failed service "irqbalance" on aarch64.o.o)
Actions #15

Updated by okurz about 5 years ago

  • Related to action #53573: Failed service "irqbalance" on aarch64.o.o added
Actions #16

Updated by okurz about 5 years ago

  • Description updated (diff)
Actions #17

Updated by okurz almost 5 years ago

had a chat with lrupp/kl_eisbaer: He made me aware of https://build.opensuse.org/package/show/OBS:Server:Unstable/OBS-WorkerOnly which provides images that are used by OBS workers.

The images are loaded by PXE, the PXE config points to an image - and this image path is a symlink here. Allows to easily switch from one image to the other, if something is identified as broken.

[04/02/2020 10:20:36] <kl_eisbaer> the most funny part is the one that adjusts the worker after the PXE boot.
[04/02/2020 10:20:57] <kl_eisbaer> here we use a script in the init phase, that downloads files/settings from the server
[04/02/2020 10:21:22] <kl_eisbaer> with this, we can adjust the configuration of the worker (how many parallel builds, how much disk space, etc)
[04/02/2020 10:21:44] <kl_eisbaer> okurz: if you are interested, I can give you a short introduction into our setup
[04/02/2020 10:22:36] <kl_eisbaer> ...which even allows us to move workers between OBS/IBS via script since we have access to the switches :-)
Actions #18

Updated by okurz almost 5 years ago

My current proposal is the following:

  • Ensure salt-minion on all o3 workers
  • Ensure salt-master on o3
  • Ensure workers are connected to o3 and salt key is accepted
  • Move gitlab.suse.de/openqa/salt-states-openqa to github, e.g. in https://github.com/os-autoinst scope, and create back-mirror into salt-states repo or get rid of it completely

Anyone sees problems with this approach?

Actions #19

Updated by okurz about 4 years ago

  • Target version set to Ready
Actions #20

Updated by nicksinger about 4 years ago

RBrownSUSE wrote:

Indeed, this is intentional, but temporarily intentional

Given Salt's habit of exploding spectacularly when the master is not updated but the minions are, and given we patched the minions first and they now auto update, it would be suicidal to have a master running on leap 42.3 o3 talking to minions running on Leap 15.0 workers.

When o3 is also Leap 15 and updated at least as frequently as the workers, then this makes sense, thanks for tracking the item :)

@okurz this still applies somewhat. While o3 is on 15.2 in the meantime it still needs manual updates. This raises a few points from my side:

  • Isn't the topic "install salt" blocked by the migration of o3 onto transnational servers?
  • If salt explodes that spectacularly with non-matching versions; should we maybe look into something like ansible?
  • Should we at least deploy a ssh-key onto ariel which can access all workers over ssh and install something like pssh (https://linux.die.net/man/1/pssh)?
Actions #21

Updated by RBrownSUSE about 4 years ago

nicksinger wrote:

@okurz this still applies somewhat. While o3 is on 15.2 in the meantime it still needs manual updates. This raises a few points from my side:

  • Isn't the topic "install salt" blocked by the migration of o3 onto transnational servers?

Saltstack supports transactional systems meanwhile - https://github.com/openSUSE/salt/pull/271

  • If salt explodes that spectacularly with non-matching versions; should we maybe look into something like ansible?

Salt can run masterless, in which case the versions are unrelated

Actions #22

Updated by nicksinger about 4 years ago

RBrownSUSE wrote:

nicksinger wrote:

@okurz this still applies somewhat. While o3 is on 15.2 in the meantime it still needs manual updates. This raises a few points from my side:

  • Isn't the topic "install salt" blocked by the migration of o3 onto transnational servers?

Saltstack supports transactional systems meanwhile - https://github.com/openSUSE/salt/pull/271

kind of https://github.com/saltstack/salt/pull/58520 ;)
It also doesn't solve the problem of o3 being upgraded manually so version differences can still happen. Masterless salt is an interesting point you have there. Can it cover our (current) main use-case: execute commands on multiple hosts?

Actions #23

Updated by okurz about 4 years ago

I would not be concerned with version differences until I see that failing. And running salt-master (again) on o3 and a salt-minion on each worker, accepting salt keys, and then simply use it for distributed command execution, e.g. cmd.run is a good start.

Actions #24

Updated by okurz about 4 years ago

  • Estimated time set to 80142.00 h
Actions #25

Updated by okurz about 4 years ago

  • Estimated time deleted (80142.00 h)
Actions #26

Updated by okurz over 3 years ago

  • Tags set to salt, system management, o3, osd, open source, infrastucture
  • Tracker changed from action to coordination
  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Subject changed from Manage o3 infrastructure with salt again to [epic] Manage o3 infrastructure with salt again
  • Category set to Organisational
  • Assignee set to okurz
  • Parent task set to #80142
Actions #27

Updated by okurz over 3 years ago

  • Status changed from Workable to Blocked

Created two specific subtasks to make picking up easier :)

Actions #28

Updated by okurz over 3 years ago

  • Target version changed from Ready to future

with the two subtasks in future we can also move this epic there for now

Actions #29

Updated by okurz almost 2 years ago

  • Tags changed from salt, system management, o3, osd, open source, infrastucture to salt, system management, o3, osd, open source, infrastructure, infra
Actions #30

Updated by okurz almost 2 years ago

  • Tags changed from salt, system management, o3, osd, open source, infrastructure, infra to salt, system management, o3, osd, open source, infra
Actions

Also available in: Atom PDF