action #118660
opencoordination #127040: [epic] Scale out: Easier and automated disaster recovery deployments of openQA
Basic terraform recipe to replace OSD w/ workers (in the cloud) size:M
Added by livdywan about 2 years ago. Updated over 1 year ago.
100%
Description
Motivation¶
We investigated the general feasibility of running an OSD clone in cloud in #88341 and #100581 is about documenting a setup for workers. Using Terraform would give us a more efficient, generic way to setup OSD (in several ways depending on backends) without relying on clicking around in e.g. AWS web interfaces.
Acceptance criteria¶
- AC1: A main.tf exists that allows setup of OSD in the cloud
- AC2: One or multiple workers are setup
Suggestions¶
- Look at https://learn.hashicorp.com/terraform
- Read https://www.terraform.io/docs
- See e.g. Docker setup of Terraform https://learn.hashicorp.com/tutorials/terraform/docker-build?in=terraform/docker-get-started
- Look for training videos at O'Reilly
- Use the terraform recipes from https://github.com/os-autoinst/openQA/pull/4880 to setup your own instance with terraform to see how openQA bootstrap would be called
- Create a script or documentation how from that one would setup a single or multiple workers connecting to another instance
Out of scope¶
- It is ok if the terraform recipes provide a good baseline for most of the work and the rest is done manually or with quick temporary changes to the terraform recipes as desired. So not 100% needs to be automated. If you e.g. use the AWS webUI to bring in final touches yourself but have that noted down that is completely fine
Updated by jbaier_cz about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to jbaier_cz
I might be interested in looking into this. There is supposed to be a good book about the topic: Terraform Cookbook from Dec 2023 (not a typo).
Updated by openqa_review about 2 years ago
- Due date set to 2022-11-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by jbaier_cz about 2 years ago
So far I started with https://github.com/os-autoinst/openQA/pull/4880 which is a simple draft created according to https://progress.opensuse.org/projects/openqav3/wiki/Wiki#section-71; in the next steps, I would like to see, if I can use another terraform provider (like docker) to test it locally and/or try to login to AWS console and test/debug it there.
Updated by jbaier_cz about 2 years ago
- Due date deleted (
2022-11-08) - Status changed from In Progress to Workable
- Assignee deleted (
jbaier_cz)
I updated my pull request to include more settings (inspired by the terraform script created as part of os-autoinst-distri-opensuse). Unfortunately, my aws credential seems to be no longer valid so I am unable to test that for real. Also creating code for other provider will not share code, so that is also not a way out. There is however an alternative approach we can investigate: terraform-local and localstack project should be able to mimic the aws api locally via docker and it should be also feasible to run this inside Github actions to test the terraform code without actually involving the aws.
Due to my upcoming vacations I am unable to finish this ticket in a foreseeable future, so I will unassign myself and set it back to workable (someone else can pickup and continue where I ended, if that is needed or desired). In the current form, the PR should satisfy the AC1. AC2 will need some additional automation over the newly created VMs (the magic phrase to search is remote-exec Provisioner
)
Updated by livdywan about 2 years ago
Let's look into it during the mob session on Thursday
Updated by livdywan about 2 years ago
jbaier_cz wrote:
Unfortunately, my aws credential seems to be no longer valid so I am unable to test that for real. Also creating code for other provider will not share code, so that is also not a way out.
I'm looking into access to AWS following internal documentation about landing zone access. Unfortunately it seems all of our accounts need to be replaced. See SD-103992.
Updated by livdywan about 2 years ago
Resources¶
- https://docs.localstack.cloud/integrations/terraform/
- https://www.scien.cx/2021/07/03/localstack-with-terraform-and-docker-for-running-aws-locally/
- https://docs.localstack.cloud/get-started/#localstack-cockpit GUI for debugging (available as an AppImage, requires docker)
- https://docs.localstack.cloud/ci/github-actions/
Run terraform with fake aws locally¶
cd container/terraform
podman run --rm -it --name terraform -v $(pwd):/workspace -w /workspace hashi
corp/terraform:light validate
podman run --rm -it --name terraform -v $(pwd):/workspace -w /workspace hashicorp/terraform:light init ## this needs to be run once; providers will be downloaded to a local folder
podman run --rm -it --name localstack -p 4566:4566 -p 4510-4559:4510-4559 -v $(pwd):/workspace -w /workspace localstack/localstack:latest
podman run --rm -it --network host --name terraform -v $(pwd):/workspace -w /workspace hashicorp/terraform:light apply
╷
│ Warning: Argument is deprecated
│
│ with provider["registry.terraform.io/hashicorp/aws"],
│ on main.tf line 18, in provider "aws":
│ 18: s3_force_path_style = false
│
│ Use s3_use_path_style instead.
│
│ (and one more similar warning elsewhere)
╵
╷
│ Warning: Attribute Deprecated
│
│ with provider["registry.terraform.io/hashicorp/aws"],
│ on main.tf line 18, in provider "aws":
│ 18: s3_force_path_style = false
│
│ Use s3_use_path_style instead.
│
│ (and one more similar warning elsewhere)
╵
Interim verdict / next steps¶
- Without the "pro" version we get a sanity check of the AWS setup but it won't setup working containers
- Reproduce the above commands in the form of a GitHub action @cdywan
- Get new AWS accounts (see SD ticket above) @cdywan
- Confirm that this can be deployed on the actual AWS - to be done in another mob session
- We saw some deprecation warnings, those should be investigated @tina
Updated by livdywan about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Updated by tinita about 2 years ago
Simply replacing
s3_force_path_style = true
with
s3_use_path_style = true
gets rid of the deprecation warning.
The docs at https://docs.localstack.cloud/integrations/terraform/ are outdated a bit, it seems.
Updated by openqa_review about 2 years ago
- Due date set to 2022-11-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan about 2 years ago
- For CI (https://github.com/os-autoinst/openQA/pull/4880) override files could be used to avoid hard-coding localstack values https://developer.hashicorp.com/terraform/language/files/override
- Add ~/.aws/credentials without relying on
aws configure
aws_access_key_id= aws_secret_access_key= aws_session_token=
- Verify what's running iva
aws ec2 describe-instances | jq '.Reservations | .[] | .Instances | .[] | .ImageId,.InstanceId,.PublicDnsName'
analoguous to checking the AWS console
variable "aws_access_key_id" { default = "test" }
variable "aws_secret_access_key" { default = "test" }
variable "aws_session_token" { default = "test" }
provider "aws" {
region = var.region
access_key = var.aws_access_key_id
secret_key = var.aws_secret_access_key
token = var.aws_session_token
s3_use_path_style = true
}
- Place credentials as key-value pairs in
container/terraform/terraform.tfvars
- Consider deleting old state when terraform gets confused:
rm terraform.tfstate*
export TF_LOG="TRACE"
doesn't seem to do much- We tried to provision SSH keys via Terraform using
resource "aws_key_pair" "deployer" { public_key = "ssh-rsa ....... user@suse.de" }
, see also https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/key_pair and also via user-data in aws-instance sections. Neither could be confirmed to work. - The web UI doesn't correctly spin up; without SSH access we had no way of investigating the problem.
Updated by livdywan about 2 years ago
I pushed an update to the PR that adds terraform and CI which relies on overrides and variables and works for me locally.
Still need to experiment further with SSH key deployment.
Updated by livdywan almost 2 years ago
cdywan wrote:
I pushed an update to the PR that adds terraform and CI which relies on overrides and variables and works for me locally.
I browsed stackoverflow and read up on workspaces and dynamic blocks which allowed me to cleanly use the same configuration locally and in CI unmodified.
aws_access_key_id= aws_secret_access_key= aws_session_token=
These can now, and this is mentioned in variables.tf
as a comment, be put into a file terraform.tfvars
. Note that these are invalidated automatically. Between sessions I had to replace all of them to test on AWS.
Also, the image ID experied as well in the meantime. So I'm not trying to keep it "correct" at this point. Maybe it needs to be filled in whenever it's used in production.
Still need to experiment further with SSH key deployment.
Still no real progress there. I tried some things to get something to run but I can only guess why it won't since I'm still fyling blind.
Updated by livdywan almost 2 years ago
- Due date deleted (
2022-11-25) - Status changed from In Progress to Workable
cdywan wrote:
Still need to experiment further with SSH key deployment.
Still no real progress there. I tried some things to get something to run but I can only guess why it won't since I'm still fyling blind.
Maybe somebody else would like to give it a go. I don't see a way to split up or refine the AC, but I simply can't spot the problem with exposing SSH or web UI services.
Updated by okurz almost 2 years ago
- Subject changed from Basic terraform recipe to replace OSD w/ workers (in the cloud) size:M to Basic terraform recipe to replace OSD w/ workers (in the cloud)
- Status changed from Workable to New
- Assignee deleted (
livdywan)
Let's re-evaluate.
Updated by robert.richardson almost 2 years ago
- Status changed from New to Blocked
blocked due to ssh not working #121222
Updated by tinita almost 2 years ago
- Status changed from Blocked to New
#121222 resolved, so not blocked anymore
Updated by okurz almost 2 years ago
- Subject changed from Basic terraform recipe to replace OSD w/ workers (in the cloud) to Basic terraform recipe to replace OSD w/ workers (in the cloud) size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 2 years ago
- Project changed from 46 to openQA Infrastructure
Updated by okurz almost 2 years ago
- Project changed from openQA Infrastructure to openQA Project
Updated by livdywan over 1 year ago
- AC2: One or multiple workers are setup
- We should add another aws_instance for a worker
- Workers can probably be "internal" to the network
- Add steps from #118660#note-14 to the README.md in a Terraform
- For a recent openSUSE Leap image check https://pint.suse.com/?resource=images&csp=amazon&state=active®ion=eu-central-1&search=leap
- Suggest specifying image_id
With this we can review/ merge the PR.
Follow up steps:
- Allow multiple workers e.g. something like https://stackoverflow.com/questions/60716579/terraform-creating-multiple-instances-with-for-each
- Workers need to talk to the web UI
- Developer mode needs to be able to connect to the worker if we want to support it
Updated by livdywan over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
cdywan wrote:
- Add steps from #118660#note-14 to the README.md in a Terraform
- For a recent openSUSE Leap image check https://pint.suse.com/?resource=images&csp=amazon&state=active®ion=eu-central-1&search=leap
- Suggest specifying image_id
Updated by openqa_review over 1 year ago
Setting due date based on mean cycle time of SUSE QE Tools
Updated by openqa_review over 1 year ago
Setting due date based on mean cycle time of SUSE QE Tools
Updated by openqa_review over 1 year ago
Setting due date based on mean cycle time of SUSE QE Tools
Updated by openqa_review over 1 year ago
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 1 year ago
- Category set to Feature requests
- Status changed from In Progress to New
- Assignee deleted (
livdywan) - Target version changed from Ready to future
I need to remove this from the backlog and I assume you mentioned this one as a candidate that you would unassign anyway, wasn't it?