Project

General

Profile

Actions

action #98415

closed

Deploy container private registry for openQA tests

Added by jlausuch almost 3 years ago. Updated 7 months ago.

Status:
Rejected
Priority:
Low
Assignee:
Target version:
-
Start date:
2021-09-09
Due date:
% Done:

0%

Estimated time:

Description

Currently, we have an unsecured registry deployed in a VM in a Amazon Cloud.
The setup is very basic and it just follows the steps in this confluence page.

As an openQA test developer, I would like to have a more reliable and secure private container registry to be used by the container tests run in:

  • openqa.suse.de
  • openqa.opensuse.org

The registry should contain (at least) these images:

  • hello-world
  • alpine:latest
  • centos:latest
  • fedora:latest
  • debian:latest

One possible implementation is Harbor registry, which is an open source registry with some interesting features.
Another option is to use existing Container Engines in the Public Cloud providers (e.g. Amazon ECR, Google Container Registry, etc).

The proposed implementation should have some cost planning and reasoning (e.g. we just justify if we choose container engine because it might be more expensive than having a VM in the cloud with Harbor).
Additional implementation ideas are welcome.


Related issues 3 (0 open3 closed)

Related to Containers - action #126947: Create Ansible configuration for registry.qe.suse.deRejectedrbranco2023-03-30

Actions
Related to Containers - action #130222: Update host and refresh container images we have in our registry Resolved2023-06-01

Actions
Related to Containers - action #130237: [investigation] Possibility to host some container images in registry.opensuse.org Rejectedrbranco2023-06-01

Actions
Actions #1

Updated by jlausuch almost 3 years ago

  • Subject changed from Improve private registry for openQA tests to Deploy container private registry for openQA tests
Actions #2

Updated by jlausuch almost 3 years ago

  • Priority changed from Normal to Low
Actions #3

Updated by jlausuch over 2 years ago

  • Status changed from New to Workable
Actions #4

Updated by jlausuch about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to jlausuch
  • Priority changed from Low to Normal
Actions #5

Updated by jlausuch about 2 years ago

I setup a registry using Amazon ECR instead of creating a new VM in the new Amazon account.
I have documented how to do it here: https://confluence.suse.com/pages/viewpage.action?pageId=1016692995

This is the PR to switch to that new registry: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/15034
We need to remove library from the pull command.

Actions #6

Updated by jlausuch about 2 years ago

While doing verification runs, I bumped into an issue:

# docker image pull public.ecr.aws/r6r9k8r3/hello-world; echo isnlI-$?-
Using default tag: latest
latest: Pulling from r6r9k8r3/hello-world
toomanyrequests: Rate exceeded

Example: https://openqa.suse.de/tests/8929837#step/docker/172

This only happens when using docker, never in podman. Not sure why, I asked this in stackoverflow.

There is some information about this here:
https://docs.aws.amazon.com/AmazonECR/latest/public/public-service-quotas.html
Apparently, there is a hard limit of 1 in Rate of unauthenticated image pulls (not sure what that 1 means).

I am trying to workaround this.. apparently, doing retries work.

Actions #7

Updated by jlausuch about 2 years ago

Mean time, I realized there is an AWS public registry, called AWS Public Gallery
https://gallery.ecr.aws/

It contains images from trusted sources (official accounts that also push stuff to docker.io), and it also contains the images from the public registries created by individuals, as I have done in our AWS account...

The good thing, is that we can use the trusted sources:
For example, for Alpine we have
https://gallery.ecr.aws/docker/library/alpine
Same for other images (ubuntu, centos, debian, etc...).

The problem of toomanyrequests: Rate exceeded when using docker still exists, but I am trying some code changes for that to not happen in our tests.

Actions #9

Updated by jlausuch about 2 years ago

  • Status changed from In Progress to Workable
  • Priority changed from Normal to Low

After some investigation, using our own ECR registry or the public gallery has the same limitations. I have ran sets of 500 tests to see how often this "rate exceeded" happens and it's not so sporadic, maybe 20% of the test cases were failing with this limitation.

So, what I did in the end is to create another VM in the new AWS account following https://confluence.suse.com/display/qasle/Building+a+private+registry

Then, I have updated the IP of that new registry in all the jobs and test suites in OSD and O3:
https://gitlab.suse.de/qac/qac-openqa-yaml/-/merge_requests/831
https://github.com/os-autoinst/opensuse-jobgroups/pull/156

So, the situation of this ticket is the same. We still have an insecure registry in a VM with all the drawbacks that this approach has. The ticket is still open for implementation with better solution.

Actions #10

Updated by jlausuch almost 2 years ago

  • Assignee deleted (jlausuch)
Actions #11

Updated by ilausuch over 1 year ago

  • Assignee set to ilausuch
Actions #12

Updated by ilausuch over 1 year ago

  • Assignee deleted (ilausuch)
Actions #14

Updated by jlausuch over 1 year ago

rbranco wrote:

What about using the Github Container Registry?

https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

Could you please further investigate this? It sounds very interesting.
Basically, we need a place to store some test images for testing that dont' belong to us (alpine, ubuntu, fedora, etc). Those images will be pulled by the tests. So, if it's public, we don't care as we are not exposing any internal image.

Actions #15

Updated by rbranco over 1 year ago

  • Assignee set to rbranco
Actions #16

Updated by rbranco over 1 year ago

  • Status changed from Workable to In Progress
Actions #17

Updated by rbranco over 1 year ago

jlausuch wrote:

So, what I did in the end is to create another VM in the new AWS account following https://confluence.suse.com/display/qasle/Building+a+private+registry

I modified the document to add -e REGISTRY_STORAGE_DELETE_ENABLED=true -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io to docker run. The first variable enables the scheduler to garbage delete old stuff according to https://docs.docker.com/registry/recipes/mirror/

I found one problem with the approach of the pull through cache described in the document. The server is easily DOS'able by a malicious actor filling up the disk by pulling irrelevant images. Though this can be fixed with Nginx as reverse proxy and an IP whitelist.

I even managed to render it unable to cache anything by just pulling busybox and then running my Registry listing tool: the cache proxied the requests for listing the busybox repository, downloading the manifests for all tags, eventually getting the TOOMANYREQUESTS error: "You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit".

For this reason I believe that a vanilla Registry server with "manually" pushed images is better than a pull through cache.

So, the situation of this ticket is the same. We still have an insecure registry in a VM with all the drawbacks that this approach has. The ticket is still open for implementation with better solution.

If we're serving base images I don't see the issue with having an "insecure" (it's really just plain HTTP) Registry as the images have a hash. Ubuntu repositories are plain HTTP with a hash signed by GPG key.

Actions #18

Updated by jlausuch over 1 year ago

rbranco wrote:

jlausuch wrote:

So, what I did in the end is to create another VM in the new AWS account following https://confluence.suse.com/display/qasle/Building+a+private+registry

I modified the document to add -e REGISTRY_STORAGE_DELETE_ENABLED=true -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io to docker run. The first variable enables the scheduler to garbage delete old stuff according to https://docs.docker.com/registry/recipes/mirror/

I found one problem with the approach of the pull through cache described in the document. The server is easily DOS'able by a malicious actor filling up the disk by pulling irrelevant images. Though this can be fixed with Nginx as reverse proxy and an IP whitelist.

I even managed to render it unable to cache anything by just pulling busybox and then running my Registry listing tool: the cache proxied the requests for listing the busybox repository, downloading the manifests for all tags, eventually getting the TOOMANYREQUESTS error: "You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit".

For this reason I believe that a vanilla Registry server with "manually" pushed images is better than a pull through cache.

So, the situation of this ticket is the same. We still have an insecure registry in a VM with all the drawbacks that this approach has. The ticket is still open for implementation with better solution.

If we're serving base images I don't see the issue with having an "insecure" (it's really just plain HTTP) Registry as the images have a hash. Ubuntu repositories are plain HTTP with a hash signed by GPG key.

Ok, insecure should be fine for this case...
What about asking for a VM in our Infra with access from OSD and O3?

Actions #19

Updated by rbranco over 1 year ago

Ticket opened by @jlausuch for a VM: https://sd.suse.com/servicedesk/customer/portal/1/SD-113323

The garbage collector has a bug when dealing with multi-arch manifests, so we should do manual cleanup or use Gitlab:

https://github.com/distribution/distribution/issues/3178

Actions #21

Updated by jlausuch about 1 year ago

  • Related to action #126947: Create Ansible configuration for registry.qe.suse.de added
Actions #22

Updated by jlausuch about 1 year ago

Should we resolve this one and continue in #126947?

Actions #23

Updated by jlausuch about 1 year ago

  • Status changed from In Progress to Blocked

Need to fix infra issues. Some workers can't access to this registry yet.

Actions #24

Updated by jlausuch about 1 year ago

  • Related to action #130222: Update host and refresh container images we have in our registry added
Actions #25

Updated by jlausuch about 1 year ago

  • Related to action #130237: [investigation] Possibility to host some container images in registry.opensuse.org added
Actions #26

Updated by rbranco about 1 year ago

  • Status changed from Blocked to In Progress
Actions #27

Updated by rbranco about 1 year ago

An approach to (ab)use the GitHub Container Registry: https://github.com/ricardobranco777/images/pull/1

Actions #28

Updated by ph03nix about 1 year ago

Solution suggestion:

  • Use k2.qe.suse.de as container registry for OSD and OOO. This machine would reside within the RD network of SUSE and would not be reachable from the outside
  • For PublicCloud test runs, keep using our registry server on AWS (3.71.98.16)

For the first point we would need to ask the infra team to open port 5000.

Actions #29

Updated by rbranco about 1 year ago

  • Status changed from In Progress to Feedback
Actions #30

Updated by jlausuch 12 months ago

ph03nix wrote:

Solution suggestion:

  • Use k2.qe.suse.de as container registry for OSD and OOO. This machine would reside within the RD network of SUSE and would not be reachable from the outside
  • For PublicCloud test runs, keep using our registry server on AWS (3.71.98.16)

For the first point we would need to ask the infra team to open port 5000.

I am inclined to disagree. I don't see a point in maintaining 2 instances for the registry, it's even worse than current situation. It's duplicating the work... I would just keep the AWS for everything...
Let Ricardo investigate the opensuse registry solution. If we see it's too much burden for us, then let's keep AWS machine only and improve it.

Actions #31

Updated by rbranco 12 months ago

  • Status changed from Feedback to Blocked
Actions #32

Updated by ph03nix 11 months ago

Let Ricardo investigate the opensuse registry solution. If we see it's too much burden for us, then let's keep AWS machine only and improve it.

Ricardo, can you give us a status update on this one? I renewed the AWS machine, so we can just use that one, if necessary or convenient.

Actions #33

Updated by rbranco 11 months ago

ph03nix wrote:

Let Ricardo investigate the opensuse registry solution. If we see it's too much burden for us, then let's keep AWS machine only and improve it.

Ricardo, can you give us a status update on this one? I renewed the AWS machine, so we can just use that one, if necessary or convenient.

We can't use registry.opensuse.org and the issues we have with setting up our own with access from o.s.d & o3 can't be solved by IT alone.

So we have to stick with that VM.

Actions #34

Updated by rbranco 7 months ago

  • Status changed from Blocked to Rejected
Actions

Also available in: Atom PDF