Project

General

Profile

Actions

coordination #80142

open

[saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Added by okurz almost 4 years ago. Updated over 1 year ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-09-26
Due date:
% Done:

73%

Estimated time:
(Total: 2.00 h)

Description

Motivation

Nowadays a container based deployment becomes industry standard which we should fully support and prominently feature as supported both for simple single-instance setups of individual persons as well as multi-node setups in clusters.

Also, single, production instances of the webui can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot. Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc.

Acceptance criteria

  • AC1: an openQA infrastructure deployed on kubernetes is part of our continuous testing setup
  • AC2: documentation exists how to setup redundant load-balancing infrastructures
  • AC3: The support for openQA on container management frameworks is prominently presented
  • AC4: documentation exists for simple, personal single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

Suggestions

Based on the spike conducted in #69355 we can streamline the support, add documentation, introduce proper testing, consider running that setup as part of our DevOps structure, etc.

state-of-the-art is k8s so we should aim for that. Maybe a "docker compose" file is a good intermediate step, then k8s with a helm chart, potentially also some setup based on gitlab, see
https://docs.gitlab.com/ee/ci/environments/incremental_rollouts.html#blue-green-deployment


Subtasks 154 (44 open110 closed)

action #41600: fallback mechanism for apache, e.g. on osd New2018-09-26

Actions
coordination #43706: [epic] Generate "download&use" docker image of openQA for SUSE QAResolvedokurz2018-11-13

Actions
action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webuiResolvedilausuch2018-11-13

Actions
action #43715: Update upstream dockerfiles to provide an easy to use docker image of workersResolvedilausuch2018-11-13

Actions
action #43718: Docker image for webui and workers are versioned and uploaded to obs registryResolvedilausuch2018-11-13

Actions
action #80518: provide container images for aarch64Resolvedokurz2020-11-27

Actions
action #80520: Automatic tests for our openQA containers - webUI onlyResolvedilausuch2020-11-27

Actions
action #80534: publication+demo for updated openQA containersResolvedokurz2020-11-27

Actions
action #80682: Automatic tests for our openQA containers - worker onlyResolvedilausuch2020-11-27

Actions
action #80684: Automatic tests for our openQA containers - worker+webui connectionResolvedokurz2020-11-27

Actions
action #81118: automatic container tests for os-autoinstResolvedokurz2020-12-16

Actions
coordination #43934: [epic] Manage o3 infrastructure with salt againBlockedokurz2021-03-16

Actions
action #90164: Make gitlab.suse.de/openqa/salt-states-openqa publicResolvedokurz2021-03-16

Actions
action #90167: Setup initial salt infrastructure for remote management within o3New2021-03-16

Actions
action #91332: Allow to contribute to salt-states-openqa over githubWorkable2021-04-19

Actions
action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSDNew2019-08-08

Actions
coordination #55364: [epic] Let's make codecov reports reliableResolvedokurz2020-09-24

Actions
action #71857: flaky/unstable/sporadic test coverage from t/34-developer_mode-unit.tResolvedlivdywan2020-09-24

Actions
action #78043: unstable/flaky/inconsistent statement coverage in t/lib/OpenQA/SeleniumTestResolvedokurz2020-11-16

Actions
action #80268: Fix flaky coverage - t/05-scheduler-full.tResolvedokurz2020-11-24

Actions
action #80274: Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:MResolvedkraih2020-11-24

Actions
action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pmResolvedokurz2020-11-24

Actions
action #88594: codecov reports for individual files yield "forbidden" and files can not be shownResolvedokurz2021-02-15

Actions
action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.tResolvedkraih

Actions
action #69355: [spike] redundant/load-balancing webui deployments of openQAResolvedilausuch2020-07-25

Actions
openQA Infrastructure - action #70978: automatic reboots on o3 to activate new kernel versionsResolvedokurz2020-09-04

Actions
openQA Infrastructure - action #71098: openqaworker3 down but no alert was raisedResolvedokurz2020-09-08

Actions
openQA Infrastructure - action #73174: [osd][alert] Job age (scheduled) (median) alertResolvedokurz2020-10-09

Actions
action #73447: POC: Create openQA Web Application container image (feature)Resolvedilausuch2020-10-16

Actions
action #73450: POC: Create openQA worker container image (feature)Resolvedilausuch2020-10-16

Actions
openQA Infrastructure - action #76786: Configure static hostnames with salt for all salt nodesResolvedokurz

Actions
openQA Infrastructure - action #76876: Find a better (automated) way to inform infra about hanging (arm) workersResolvedlivdywan2020-11-02

Actions
action #76990: Improve documentation for redundant/load-balancing webui deployments of openQAResolvedmkittler

Actions
openQA Infrastructure - coordination #78206: [epic] 2020-11-18 nbg power outage aftermathResolvedokurz2020-11-27

Actions
openQA Infrastructure - action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz2020-11-27

Actions
openQA Infrastructure - action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz2020-11-27

Actions
openQA Infrastructure - action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

Actions
openQA Infrastructure - action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz2021-05-05

Actions
openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz2020-11-19

Actions
action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolvedmkittler2021-01-18

Actions
openQA Infrastructure - action #78438: openQA webui entry "Assigned worker" shows ip instead of names as formerly - manual cleanup workResolvedokurz2020-11-05

Actions
openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupResolvedokurz

Actions
coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobsResolvedokurz2020-12-09

Actions
action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobsResolvedmkittler2020-12-09

Actions
action #80986: terminate worker process after executing all currently assigned jobs based on config/env variableResolvedmkittler2020-12-11

Actions
openQA Infrastructure - action #81884: openqa-webui should automatically restart on config updatesResolvedokurz2021-01-08

Actions
action #89200: Switch OSD deployment to two-daily deploymentResolvedmkittler2021-02-26

Actions
action #90152: module results missing on quick job (on auto-restarting worker)Resolvedmkittler2021-03-16

Actions
action #104178: Increase OSD deployment rate from every second day to dailyResolvedokurz2021-12-20

Actions
action #104841: Prevent empty changelog messages from osd-deployment when there are no changes size:MResolvedmkittler2022-01-12

Actions
action #105379: Continuous deployment of o3 workers - one worker first size:MResolvedmkittler2022-01-24

Actions
action #105885: Continuous deployment of o3 workers - all the other o3 workers size:MResolvedmkittler

Actions
action #111028: Continuous update of o3 webUIResolvedokurz2022-05-12

Actions
action #111377: Continuous deployment of osd workers - similar as on o3 size:MRejectedokurz2022-05-20

Actions
coordination #81060: [epic] openQA web UI in kubernetesBlockedokurz2022-05-02

Actions
action #110524: [timeboxed:20h][spike] openQA proof-of-concept within kubernetes size:MResolvedjbaier_cz2022-05-02

Actions
action #110725: Unexpected behavior for cache service under k3s when the CACHE_MIN_FREE_PERCENTAGE is set size:MWorkable2022-05-06

Actions
action #111323: Simplified web proxy setup (remove path rewrite)Resolvedjbaier_cz2022-05-19

Actions
action #111329: openQA within kubernetes with tested helm charts size:MResolvedjbaier_cz2022-05-19

Actions
action #113402: Handle error messages in existing helm chart CI test logsNew2022-07-08

Actions
action #87898: Add grafana alert for "broken workers" as reported by openQAResolvedmkittler2021-01-18

Actions
action #88187: Set the addresses in the "internal clients" configurableResolvedmkittler2021-01-25

Actions
coordination #89020: [epic] Support for multiple authentication providersWorkable2021-04-09

Actions
action #90929: get OAuth2 to work with salsa.debian.org (gitlab)Resolvedmkittler2021-04-09

Actions
action #89116: Enable openQA cloud IDE workflow out of the boxNew2021-02-25

Actions
action #89755: container: Fix missing shared directories and its permissionsResolved2021-03-09

Actions
coordination #89842: [epic] Scalable and streamlined docker-compose based openQA setupNew2021-03-09

Actions
action #89719: docker-compose up fails on masterResolvedilausuch2021-03-09

Actions
action #89722: Need automatic check for docker-composeResolvedilausuch2021-03-09

Actions
action #89731: containers: The deploy using docker-compose is not stable and eventually fails Resolvedilausuch2021-03-09

Actions
action #89749: containers: Allow to the user to choose the source of images in the docker-composeNew2021-03-09

Actions
action #89752: containers: Add a worker service as part of the docker-composeResolvedmkittler2021-03-09

Actions
coordination #90359: [epic] Customizable worker engineWorkable2021-03-19

Actions
action #90362: Allow to customize worker engine by configurationResolvedokurz2021-03-19

Actions
coordination #92037: [epic] Extensible openQA: Support for external openQA pluginsWorkable2021-04-30

Actions
action #92040: Support for external openQA plugins in home directoryNew2021-04-30

Actions
coordination #92575: [epic] feature switch capabilitiesNew2021-05-12

Actions
action #92578: infrastructure-instance wide feature switchesNew2021-05-12

Actions
coordination #92854: [epic] limit overload of openQA webUI by heavy requestsBlockedokurz2021-06-12

Actions
action #93925: Optimize SQL on /tests/overviewResolvedtinita2021-06-12

Actions
action #94354: Optimize /dashboard_build_results and /group_overview/* pagesResolvedtinita2021-06-21

Actions
action #94667: Optimize products/machines/test_suites API calls size:MResolvedkraih2021-06-24

Actions
action #94705: Monitor number of SQL queries in grafanaNew2021-06-25

Actions
action #94753: Try out munin on o3Resolvedtinita2021-06-25

Actions
action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and APINew2022-05-30

Actions
action #111770: Limit finished tests on /tests, but query configurable and show complete number of jobs size:SResolvedmkittler2022-05-30

Actions
action #113189: Research where we need limits size:SResolvedmkittler2022-07-04

Actions
action #114421: Add a limit where it makes sense after we have it for /tests, query configurable size:MResolvedrobert.richardson

Actions
action #119428: Ensure users can get all the data for limited queries, e.g. with paginationNew2022-11-22

Actions
action #120841: Add pagination for GET /api/v1/assets size:MResolvedkraih2022-11-22

Actions
action #121048: Add pagination for GET /api/v1/bugsResolvedkraih2022-11-22

Actions
action #121099: Add pagination for GET /api/v1/jobs/<job_id:num>/commentsNew

Actions
action #121102: Add pagination for GET /api/v1/jobs size:MResolvedkraih

Actions
action #121105: Add pagination for GET /api/v1/test_suites, GET /api/v1/test_suites/:id, GET /api/v1/machines, GET api/v1/machines/:id, GET /api/v1/products, GET /api/v1/products:id size:MResolvedkraih

Actions
action #121108: Add pagination for GET /api/v1/workersResolvedkraih

Actions
action #121111: Add pagination for GET /api/v1/job_templatesNew

Actions
action #121114: Add pagination for GET /api/v1/job_settings/jobsNew

Actions
action #121117: Add pagination for GET /experimental/searchNew

Actions
action #119431: Inform users e.g. in the webUI if not all results are returned size:MWorkable2022-10-26

Actions
action #124230: "Off-by-one" error when using the limit=1 openQA API parameterResolvedmkittler2023-02-09

Actions
action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedlivdywan2022-02-03

Actions
action #110677: Investigation page shouldn't involve blocking long-running API routes size:MResolvedtinita2022-02-03

Actions
action #110680: Overview page shouldn't allow long-running requests without limits size:MResolvedkraih2022-02-03

Actions
action #126935: Promote /experimental/jobs/.../status as non-experimental APINew

Actions
openQA Infrastructure - action #126941: [spike] Evaluate a move of the osd database to its own VMNew2023-03-30

Actions
action #129068: Limit the number of uploadable test result steps size:MResolvedmkittler2023-04-01

Actions
action #94255: containers: Improve the speed of the container test in CINew2021-06-18

Actions
coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertNew2020-09-01

Actions
action #70774: save_needle Minion tasks fail frequently and needles could get lostNew2020-09-01

Actions
coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"New2021-12-02

Actions
action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MResolvedmkittler2021-12-02

Actions
action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

Actions
action #107533: Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:MResolvedmkittler

Actions
action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::DownloadNew2022-03-25

Actions
action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew2022-03-25

Actions
action #108989: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::NeedleNew2022-03-25

Actions
action #99837: configurable exclusion rules for /influxdb/minionNew2021-10-06

Actions
action #100503: Identify all "finalize_job_results" failures and handle them (report ticket or fix)Resolvedlivdywan2021-10-07

Actions
openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolvedmkittler

Actions
coordination #98472: [epic] Scale out: Disaster recovery deployments of existing openQA infrastructuresResolvedokurz2021-09-16

Actions
coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errorsResolvedmkittler2021-09-21

Actions
action #102578: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:MResolvedokurz2021-09-21

Actions
coordination #102581: Proof of concept of t/full-stack.t on GitHubActionsResolvedlivdywan2021-09-21

Actions
action #105429: openQA's fullstack test fails in `shutdown` moduleResolvedmkittler2022-01-25

Actions
action #106912: Fullstack test can still fail due to `shutdown` module size:MResolvedokurz2022-02-16

Actions
action #107002: Expose fullstack test video from pool directory in CI size:MResolvedlivdywan2022-02-17

Actions
action #107941: [sporadic] openQA Fullstack test t/full-stack.t can still fail with "udevadm" log message size:MResolvedkraih2022-03-07

Actions
action #120226: Ensure openQA t/full-stack.t is stable again and not tracked as unstable test size:SResolvedkraih2022-11-10

Actions
coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD)New2021-09-30

Actions
coordination #99579: [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"Resolvedokurz2021-10-01

Actions
action #99654: Revisit decision in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545 regarding I/O alerts size:SResolvedmkittler2021-10-01

Actions
coordination #99660: [epic] Use more perl signatures in our perl projectsResolvedokurz2021-10-01

Actions
action #99663: Use more perl signatures - os-autoinst size:MResolvedokurz2021-10-01

Actions
action #105127: Use more perl signatures - openQA - some simple classes size:SResolvedkodymo

Actions
openQA Infrastructure - coordination #102266: [epic] o3 ran out of disk spaceResolvedokurz2021-12-21

Actions
openQA Infrastructure - action #104217: Ask eng infra why thruk.suse.de stopped workingResolvedokurz2021-12-21

Actions
coordination #102951: [epic] Better network performance monitoringResolvedokurz2021-11-24

Actions
action #102957: Better network performance monitoring - up-/download speed from cache service, e.g. in log file size:MResolvedkraih2021-11-24

Actions
action #106901: Expose bandwidth data for worker cache via influxdb size:MResolvedkraih2022-02-16

Actions
action #106904: Monitoring for worker specific bandwidth size:MResolvedkraih2022-02-16

Actions
coordination #103122: [epic] Test distribution template for openQA tests to use, e.g. for automatic PR tests in os-autoinst-distri-openQABlockedokurz2021-11-26

Actions
action #103125: compile tests in os-autoinst-distri-openQA defined in template project size:SWorkable2021-11-26

Actions
action #107701: [osd] Job detail page fails to loadResolvedtinita2022-02-28

Actions
openQA Infrastructure - action #108740: qa-power8-5-kvm minions alert is heart-broken 💔️Rejectedokurz2022-03-22

Actions
openQA Infrastructure - action #108743: qa-power8-5-kvm minions alert is heart-brokenResolvednicksinger2022-03-22

Actions
coordination #109659: [epic] More remote workersBlockedokurz2022-01-17

Actions
openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:MResolvedmkittler2022-01-17

Actions
action #107497: [qe-tools] openqaworker14 (and openqa15) - developer mode fails size:MResolvedmkittler2022-02-24

Actions
openQA Infrastructure - action #110467: Establish reliable tap setup on ow14New2022-04-29

Actions
action #110515: Command export feature for openqa-clone-job size:MResolvedmkittler2022-04-29

Actions
openQA Infrastructure - action #113366: Add three more Prague located OSD workers size:MResolvedosukup

Actions
openQA Infrastructure - action #124562: Re-install at least one of the new OSD workers located in PragueNew

Actions
coordination #109968: [epic] fully containerized openQA on SLE microOSNew2022-04-14

Actions
action #126938: [spike] Simple load-balancing proxy for select API routesNew2023-03-30

Actions

Related issues 5 (3 open2 closed)

Related to openQA Project - action #80466: docker: Base the webUI and worker Dockerfiles in TumbleweedWorkable2020-11-26

Actions
Related to openQA Project - action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:MResolvedmkittler2021-05-20

Actions
Related to openQA Project - action #110497: Minion influxdb data causing unusual download rates size:MResolvedmkittler2022-05-01

Actions
Related to openQA Project - coordination #80150: [epic] Scale out openQA: Easier openQA setupBlockedokurz2020-11-04

Actions
Copied to openQA Project - coordination #121723: [saga][epic] Scale out: Future uses of on-premise cloud or public cloudNew2023-11-27

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA to [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes)
  • Status changed from Workable to Blocked
  • Assignee set to okurz

I created this saga as apparently the whole "container" feature block was not apparent enough for some stakeholders.

Some subtasks are there, ready to be worked on. Setting to "Blocked" by subtasks.

Actions #2

Updated by okurz almost 4 years ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes) to [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
  • Description updated (diff)
Actions #3

Updated by okurz almost 4 years ago

  • Tracker changed from action to coordination
Actions #4

Updated by ilausuch almost 4 years ago

  • Related to action #80466: docker: Base the webUI and worker Dockerfiles in Tumbleweed added
Actions #5

Updated by livdywan almost 4 years ago

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

Actions #6

Updated by okurz over 3 years ago

  • Description updated (diff)

cdywan wrote:

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

Right. I rephrased it so that it's clear in the motivation we mean "single, not-load-balanced production instances" in contrast to "personal instances that are fast to setup"

Actions #7

Updated by livdywan over 3 years ago

  • Related to action #89749: containers: Allow to the user to choose the source of images in the docker-compose added
Actions #8

Updated by okurz over 3 years ago

  • Related to deleted (action #89749: containers: Allow to the user to choose the source of images in the docker-compose)
Actions #9

Updated by okurz over 3 years ago

  • Subject changed from [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes to [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
Actions #10

Updated by mkittler almost 3 years ago

  • Related to action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:M added
Actions #11

Updated by okurz over 2 years ago

  • Related to action #110497: Minion influxdb data causing unusual download rates size:M added
Actions #12

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #13

Updated by okurz over 2 years ago

  • Description updated (diff)
Actions #14

Updated by okurz almost 2 years ago

  • Copied to coordination #121723: [saga][epic] Scale out: Future uses of on-premise cloud or public cloud added
Actions #15

Updated by okurz over 1 year ago

  • Target version changed from Ready to future
Actions #16

Updated by okurz over 1 year ago

Actions

Also available in: Atom PDF