Project

General

Profile

Actions

coordination #40199

closed

[EPIC] Better rollback capabilities of (worker) deployments

Added by okurz over 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-08-23
Due date:
% Done:

0%

Estimated time:

Description

As an outcome of #39743 we learned Complete deployment rollbacks for the whole infrastructure would be nice (including openQA packages, database and test settings, system packages on both web UI as well as workers) but there will always be factors which are changing outside our control


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-outResolvedokurz2018-08-15

Actions
Actions #1

Updated by okurz over 6 years ago

  • Related to action #39743: [o3][tools] o3 unusable, often responds with 504 Gateway Time-out added
Actions #2

Updated by coolo over 6 years ago

I have my doubts that this is a reasonable request. Wishful thinking might lead to this, but IMO you need quite a deployment team to roll something like this. And on top of that: I don't think the deployment strategy is part of the 'openQA Project'.

Actions #3

Updated by coolo over 6 years ago

  • Subject changed from [tools] Better rollback capabilities of deployments to [EIPC] Better rollback capabilities of (worker) deployments
  • Target version set to future

For the webui I have no good idea - but for the workers we could complement salt with workers reinstalling a defined state on boot. This is still a huge task - and work intensive to maintain, so I'm not really sure we should invest there.

Actions #4

Updated by okurz over 6 years ago

  • Subject changed from [EIPC] Better rollback capabilities of (worker) deployments to [EPIC] Better rollback capabilities of (worker) deployments

I guess you meant "EPIC" instead of "EIPC" ;) You are loosing some part of the original idea when you restrict it with "(worker)" and not cover the web UI part anymore. I am with you that this is no easy "let's hack some perl" task but still I see it as feasible. And isn't this basically also a business case we sell to customers? At least on feasible – albeit maybe not the best – approach to reach the goal of the (original) ticket description would be:

  • Use btrfs with snapshots on / for each machine (done for workers, missing for webui)
  • Only ever upgrade the webui together with a full database dump saved just before the upgrade (script or salt should work)
  • Train dry-runs with all involved admins of the "worst case scenarios" to have them less scared and reduce the recovery time in case of emergencies
  • Optional: Save RPM files used for installation on both webui + worker elsewhere to be able to go back to or automatic maintenance requests for tested packages based on openQA-in-openQA which makes sure that older versions of package are saved "automatically" but probably the openQA updates are too heavy for the maintenance workflow
Actions #5

Updated by okurz over 5 years ago

  • Category changed from 168 to Feature requests
Actions #6

Updated by okurz over 5 years ago

so for o3 what works quite well is to have transactional server worker hosts and for the o3 webui host keep packages from devel:openQA repos, a simple keeppackages=1 in the .repo files. We commonly save a database dump when we update the webui host so that part is also covered. And also we have automation for the complete o3 upgrade and getting nearer with it on osd as well.

Actions #7

Updated by okurz over 5 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz
Actions #8

Updated by okurz about 5 years ago

  • Status changed from Feedback to Resolved

so we have a gitlab CI pipeline with rollback possibilities for OSD … I guess this is as good as it gets for now.

Actions #9

Updated by szarate over 4 years ago

  • Tracker changed from action to coordination
Actions

Also available in: Atom PDF