Project

General

Profile

Actions

action #118660

open

coordination #127040: [epic] Scale out: Easier and automated disaster recovery deployments of openQA

Basic terraform recipe to replace OSD w/ workers (in the cloud) size:M

Added by livdywan about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2022-12-01
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Motivation

We investigated the general feasibility of running an OSD clone in cloud in #88341 and #100581 is about documenting a setup for workers. Using Terraform would give us a more efficient, generic way to setup OSD (in several ways depending on backends) without relying on clicking around in e.g. AWS web interfaces.

Acceptance criteria

  • AC1: A main.tf exists that allows setup of OSD in the cloud
  • AC2: One or multiple workers are setup

Suggestions

Out of scope

  • It is ok if the terraform recipes provide a good baseline for most of the work and the rest is done manually or with quick temporary changes to the terraform recipes as desired. So not 100% needs to be automated. If you e.g. use the AWS webUI to bring in final touches yourself but have that noted down that is completely fine

Subtasks 1 (0 open1 closed)

action #121222: Add ssh support to terraform recipe size:MResolvedtinita2022-12-01

Actions
Actions #2

Updated by jbaier_cz about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz

I might be interested in looking into this. There is supposed to be a good book about the topic: Terraform Cookbook from Dec 2023 (not a typo).

Actions #3

Updated by openqa_review about 2 years ago

  • Due date set to 2022-11-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by jbaier_cz about 2 years ago

So far I started with https://github.com/os-autoinst/openQA/pull/4880 which is a simple draft created according to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#section-71; in the next steps, I would like to see, if I can use another terraform provider (like docker) to test it locally and/or try to login to AWS console and test/debug it there.

Actions #5

Updated by jbaier_cz about 2 years ago

  • Due date deleted (2022-11-08)
  • Status changed from In Progress to Workable
  • Assignee deleted (jbaier_cz)

I updated my pull request to include more settings (inspired by the terraform script created as part of os-autoinst-distri-opensuse). Unfortunately, my aws credential seems to be no longer valid so I am unable to test that for real. Also creating code for other provider will not share code, so that is also not a way out. There is however an alternative approach we can investigate: terraform-local and localstack project should be able to mimic the aws api locally via docker and it should be also feasible to run this inside Github actions to test the terraform code without actually involving the aws.

Due to my upcoming vacations I am unable to finish this ticket in a foreseeable future, so I will unassign myself and set it back to workable (someone else can pickup and continue where I ended, if that is needed or desired). In the current form, the PR should satisfy the AC1. AC2 will need some additional automation over the newly created VMs (the magic phrase to search is remote-exec Provisioner)

Actions #6

Updated by livdywan about 2 years ago

Let's look into it during the mob session on Thursday

Actions #7

Updated by livdywan about 2 years ago

jbaier_cz wrote:

Unfortunately, my aws credential seems to be no longer valid so I am unable to test that for real. Also creating code for other provider will not share code, so that is also not a way out.

I'm looking into access to AWS following internal documentation about landing zone access. Unfortunately it seems all of our accounts need to be replaced. See SD-103992.

Actions #8

Updated by livdywan about 2 years ago

Resources

Run terraform with fake aws locally

    cd container/terraform

    podman run --rm -it --name terraform -v $(pwd):/workspace -w /workspace hashi
corp/terraform:light validate
    podman run --rm -it --name terraform -v $(pwd):/workspace -w /workspace hashicorp/terraform:light init ## this needs to be run once; providers will be downloaded to a local folder
    podman run --rm -it --name localstack -p 4566:4566 -p 4510-4559:4510-4559 -v $(pwd):/workspace -w /workspace localstack/localstack:latest
    podman run --rm -it --network host --name terraform -v $(pwd):/workspace -w /workspace hashicorp/terraform:light apply
    ╷                                                                                                                       
│ Warning: Argument is deprecated                                                                                       
│                                                                                                                       
│   with provider["registry.terraform.io/hashicorp/aws"],                                                               
│   on main.tf line 18, in provider "aws":                                                                              
│   18:   s3_force_path_style         = false                                                                           
│                                                                                                                       
│ Use s3_use_path_style instead.                                                                                        
│                                                                                                                       
│ (and one more similar warning elsewhere)                                                                              
╵                                                                                                                       
╷                                                                                                                       
│ Warning: Attribute Deprecated                                                                                         
│                                                                                                                       
│   with provider["registry.terraform.io/hashicorp/aws"],                                                               
│   on main.tf line 18, in provider "aws":                                                                              
│   18:   s3_force_path_style         = false                                                                           
│                                                                                                                       
│ Use s3_use_path_style instead.                                                                                        
│                                                                                                                       
│ (and one more similar warning elsewhere)
╵

Interim verdict / next steps

  • Without the "pro" version we get a sanity check of the AWS setup but it won't setup working containers
  • Reproduce the above commands in the form of a GitHub action @cdywan
  • Get new AWS accounts (see SD ticket above) @cdywan
  • Confirm that this can be deployed on the actual AWS - to be done in another mob session
  • We saw some deprecation warnings, those should be investigated @tina
Actions #9

Updated by livdywan about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
Actions #10

Updated by tinita about 2 years ago

Simply replacing

  s3_force_path_style         = true

with

 s3_use_path_style            = true

gets rid of the deprecation warning.

The docs at https://docs.localstack.cloud/integrations/terraform/ are outdated a bit, it seems.

Actions #11

Updated by openqa_review about 2 years ago

  • Due date set to 2022-11-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by livdywan about 2 years ago

variable "aws_access_key_id" { default = "test" }
variable "aws_secret_access_key" { default = "test" }
variable "aws_session_token" { default = "test" }

provider "aws" {
  region                      = var.region
  access_key                  = var.aws_access_key_id
  secret_key                  = var.aws_secret_access_key
  token                       = var.aws_session_token
  s3_use_path_style           = true
}
  • Place credentials as key-value pairs in container/terraform/terraform.tfvars
  • Consider deleting old state when terraform gets confused: rm terraform.tfstate*
  • export TF_LOG="TRACE" doesn't seem to do much
  • We tried to provision SSH keys via Terraform using resource "aws_key_pair" "deployer" { public_key = "ssh-rsa ....... user@suse.de" }, see also https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/key_pair and also via user-data in aws-instance sections. Neither could be confirmed to work.
  • The web UI doesn't correctly spin up; without SSH access we had no way of investigating the problem.
Actions #13

Updated by livdywan about 2 years ago

I pushed an update to the PR that adds terraform and CI which relies on overrides and variables and works for me locally.

Still need to experiment further with SSH key deployment.

Actions #14

Updated by livdywan about 2 years ago

cdywan wrote:

I pushed an update to the PR that adds terraform and CI which relies on overrides and variables and works for me locally.

I browsed stackoverflow and read up on workspaces and dynamic blocks which allowed me to cleanly use the same configuration locally and in CI unmodified.

aws_access_key_id=
aws_secret_access_key=
aws_session_token=

These can now, and this is mentioned in variables.tf as a comment, be put into a file terraform.tfvars. Note that these are invalidated automatically. Between sessions I had to replace all of them to test on AWS.
Also, the image ID experied as well in the meantime. So I'm not trying to keep it "correct" at this point. Maybe it needs to be filled in whenever it's used in production.

Still need to experiment further with SSH key deployment.

Still no real progress there. I tried some things to get something to run but I can only guess why it won't since I'm still fyling blind.

Actions #15

Updated by livdywan about 2 years ago

  • Due date deleted (2022-11-25)
  • Status changed from In Progress to Workable

cdywan wrote:

Still need to experiment further with SSH key deployment.

Still no real progress there. I tried some things to get something to run but I can only guess why it won't since I'm still fyling blind.

Maybe somebody else would like to give it a go. I don't see a way to split up or refine the AC, but I simply can't spot the problem with exposing SSH or web UI services.

Actions #16

Updated by okurz about 2 years ago

  • Subject changed from Basic terraform recipe to replace OSD w/ workers (in the cloud) size:M to Basic terraform recipe to replace OSD w/ workers (in the cloud)
  • Status changed from Workable to New
  • Assignee deleted (livdywan)

Let's re-evaluate.

Actions #17

Updated by robert.richardson about 2 years ago

  • Status changed from New to Blocked

blocked due to ssh not working #121222

Actions #18

Updated by tinita about 2 years ago

  • Status changed from Blocked to New

#121222 resolved, so not blocked anymore

Actions #19

Updated by okurz almost 2 years ago

  • Subject changed from Basic terraform recipe to replace OSD w/ workers (in the cloud) to Basic terraform recipe to replace OSD w/ workers (in the cloud) size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #20

Updated by okurz almost 2 years ago

  • Project changed from 46 to openQA Infrastructure (public)
Actions #21

Updated by okurz almost 2 years ago

  • Project changed from openQA Infrastructure (public) to openQA Project (public)
Actions #22

Updated by livdywan over 1 year ago

  • AC2: One or multiple workers are setup

With this we can review/ merge the PR.

Follow up steps:

Actions #23

Updated by livdywan over 1 year ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan
Actions #24

Updated by openqa_review over 1 year ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions #25

Updated by openqa_review over 1 year ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions #26

Updated by openqa_review over 1 year ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions #27

Updated by openqa_review over 1 year ago

Setting due date based on mean cycle time of SUSE QE Tools

Actions #28

Updated by okurz over 1 year ago

  • Category set to Feature requests
  • Status changed from In Progress to New
  • Assignee deleted (livdywan)
  • Target version changed from Ready to future

I need to remove this from the backlog and I assume you mentioned this one as a candidate that you would unassign anyway, wasn't it?

Actions #29

Updated by okurz over 1 year ago

  • Parent task changed from #98472 to #127040
Actions

Also available in: Atom PDF