action #67855: [tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #67855

open

[tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org

Added by okurz almost 5 years ago. Updated over 4 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Regressions/Crashes

Target version:

QA (public) - future

Start date:

2020-06-09

Due date:

% Done:

Estimated time:

Description

Observation¶

I have seen an observation similar to the following often in past weeks, latest example https://app.circleci.com/pipelines/github/os-autoinst/openQA/3231/workflows/17a0a4f7-8fe9-47e3-a50d-191a87c75cc8/jobs/30830/steps:

Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
81e5221d78cb: Pulling fs layer
3e86cb3bb3f5: Pulling fs layer
81e5221d78cb: Verifying Checksum
81e5221d78cb: Download complete
81e5221d78cb: Pull complete
  error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest

Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest

so while it looks like the layers of the container are downloaded still after 10m there is the error error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest. Probably the base 81e5221d78cb is present in cache but 3e86cb3bb3f5 fails to load. Locally I can see that this blob has 492 MiB so maybe it's just a bit big.

Acceptance criteria¶

AC1: Multiple circleci jobs succeed to create the container based environment

Suggestions¶

~~Maybe it can already help if we use --no-recommends within the package installs within Dockerfile to reduce the size of layers~~ -> nope, see #67855#note-1
Add a retry within the circleci jobs or on the level of how circleci jobs are triggered

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz almost 5 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz almost 5 years ago

Description updated (diff)

I checked comparing "--no-recommends", "--recommends" and default using e.g.:

podman run --rm -it registry.opensuse.org/opensuse/tumbleweed zypper -n in --dry-run --recommends autoconf automake gcc-c++ libtool pkgconfig\(opencv\) pkg-config perl\(Module::CPANfile\) pkgconfig\(fftw3\) pkgconfig\(libpng\) pkgconfig\(sndfile\) pkgconfig\(theoraenc\) make rubygem\(sass\) python3-base python3-requests python3-future git-core rsync curl postgresql-devel postgresql-server qemu qemu-kvm qemu-tools tar xorg-x11-fonts sudo chromedriver

and found no difference between default and "--no-recommends", what would install 1.6 GiB. With "--recommends" I would get 2.2 GiB so that won't help.

Actions

Copy link

Updated by okurz almost 5 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz almost 5 years ago

Status changed from New to Workable

Actions

Copy link

Updated by okurz almost 5 years ago

Subject changed from circleci often abort in "cache" unable to read container image from registry.opensuse.org to [tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org

Actions

Copy link

Updated by okurz over 4 years ago

Priority changed from Normal to Low
Target version changed from Ready to future

I have not observed this lately. We can bring it back to our backlog when we see it again more than once.

Actions

Copy link

Updated by okurz over 4 years ago

Priority changed from Low to Normal
Target version changed from future to Ready

happened again failing our automatic dependency update jobs: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42253 and https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42254 , similar output:

Build-agent version 1.0.40192-8f9036ea (2020-09-29T09:45:01+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux d801a2f9b317 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

2d021088: Pulling fs layer 
c64518df: Downloading [=================>                                 ]  180.6MB/526.4MB
  Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

2d021088: Pulling fs layer 
c64518df: Downloading [==================>                                ]  195.4MB/526.4MB

context deadline exceeded

I don't find good information how to handled "context deadline exceeded". I only know of https://support.circleci.com/hc/en-us/articles/360045268074-Build-Fails-with-Too-long-with-no-output-exceeded-10m0s-context-deadline-exceeded- regarding the "no output timeout".

I have found nothing regarding retrying when within the "Spin Up Environment" step.

I see it as problematic that we download a 500MB+ image in every job and have not much control over it. Maybe we can either optimize the image, move some package installations to a later step or not even use the image until in a later step?

Actions

Copy link

Updated by okurz over 4 years ago

Copied to action #72316: [tests][ci] circleci can fail in `zypper ref` due to temporary repository problems added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Workable to Feedback
Assignee set to okurz

We will try with just a different time https://github.com/os-autoinst/openQA/pull/3451

Actions

Copy link

#10

Updated by okurz over 4 years ago

Due date set to 2020-10-21

If the different time slot works good enough we might be ok with just this change then.

Actions

Copy link

#11

Updated by okurz over 4 years ago

Status changed from Feedback to Resolved

So after many nights where the nightly pipeline failed the one from yesterday just passed fine at the new time: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4447/workflows/78a764f1-2b6b-4b91-b920-3e11bc57c1cc

I consider it suboptimal that the repeated image downloading of 500MiB takes 3-5 minutes but it seems we might have found a good approach for the problem at hand. Thanks tinita for the good idea of just running the pipeline at a different time of day.

Actions

Copy link

#12

Updated by okurz over 4 years ago

Due date deleted (~~2020-10-21~~)
Status changed from Resolved to Workable
Assignee deleted (~~okurz~~)

https://app.circleci.com/pipelines/github/os-autoinst/openQA/4465/workflows/fc241df3-9822-4a09-90b0-a03ac4589dff

so much for that :D same output as in #67855#note-7

I researched again if I could find anything regarding setting timeouts on circleCI but again the only real think I found is "no_output_timeout" for "run" commands but not for the "Set up environment" part.

I am currently out of ideas except for ditching circleci :(

Actions

Copy link

#13

Updated by okurz over 4 years ago

Target version changed from Ready to future

Actions

Copy link

#14

Updated by okurz over 4 years ago

Status changed from Workable to In Progress
Assignee set to okurz
Priority changed from Normal to High
Target version changed from future to Ready

this now hits PRs of users reproducibly as well where even retry does not help, see https://github.com/os-autoinst/openQA/pull/3519#issuecomment-724252129

https://discuss.circleci.com/t/speed-up-spin-up-environment-with-custom-easticsearch-docker-image/34393/2 explains that we can try to have either more layers preserved in the container image or do more run-time installations and less up-front.

https://github.community/t/speed-comparison-github-actions-takes-3m-32s-circle-ci-takes-12s/17780 shows that circleCI can be quite fast, e.g. compared to github actions but this only works if using container images that can use the circleCI base images which are for example described in https://discuss.circleci.com/t/new-ubuntu-base-convenience-image-public-beta/33129 which are Ubuntu based. Does not help us much.

I am not sure if I have ever seen an image being already present on a host so we need to redownload every time. This is also what others report: https://discuss.circleci.com/t/spin-up-environment-docker-cache/11146/12

The combination of a heavy image in combination with downloading from slow registry.opensuse.org to circleCI workers have a significant impact here.

I am trying different approaches now:

https://github.com/os-autoinst/openQA/pull/3527 specifying "NoSquash" option for the base image

This looks promising so far. Instead of


Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 622ad35ed331 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

dbbc37c1: Pulling fs layer 
e03f6954: Downloading [======================>                            ]  239.2MB/542.9MB
  Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

dbbc37c1: Pulling fs layer 
e03f6954: Downloading [=======================>                           ]    255MB/542.9MB

context deadline exceeded

like in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4795/workflows/d766ccb1-ccf9-4c2e-a54b-2e6f7a110714/jobs/45709

we get

Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 87f84206b28e 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/home/okurz/openqa/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/home/okurz/openqa/containers/base:latest
latest: Pulling from home/okurz/openqa/containers/base

39a0c59f: Pulling fs layer 
9ab4a673: Pulling fs layer 
427d1320: Pulling fs layer 
7a9b7c44: Pulling fs layer 
bdf73f9f: Pulling fs layer 
de1ea851: Pulling fs layer 
30b208e0: Pulling fs layer 
3048358e: Pulling fs layer 
a803434d: Pulling fs layer 
0e2303e7: Pulling fs layer 
7a9b7c44: Downloading [>                                                  ]  2.113MB/254.2MB
7a9b7c44: Downloading [>                                                  ]    3.7MB/254.2MB
7a9b7c44: Downloading [>                                                  ]  4.227MB/254.2MB
7a9b7c44: Downloading [=>                                                 ]  5.284MB/254.2MB
bdf73f9f: Downloading [>                                                  ]  525.4kB/116.7MB
9a0c59f: Downloading [===>                                               ]  2.972MB/42.34MB
9a0c59f: Downloading [====>                                              ]  3.396MB/42.34MB
7a9b7c44: Downloading [=>                                                 ]  6.865MB/254.2MB
7a9b7c44: Downloading [=>                                                 ]  7.396MB/254.2MB
…
de1ea851: Downloading [=================================================> ]  184.3MB/185.3MB
de1ea851: Downloading [=================================================> ]  184.8MB/185.3MB
de1ea851: Download complete  ====================================>        ]  213.8MB/254.2MB
7a9b7c44: Downloading [==========================================>        ]  214.4MB/254.2MB
7a9b7c44: Downloading [==========================================>        ]  214.9MB/254.2MB
…
7a9b7c44: Downloading [=================================================> ]    254MB/254.2MB
7a9b7c44: Extracting [==>                                                ]  15.04MB/254.2MB
7a9b7c44: Extracting [====>                                              ]   23.4MB/254.2MB

https://github.com/os-autoinst/openQA/pull/3528 using an image that just install openQA-devel in one step -> image is 550MB so even 10MB bigger than default
https://github.com/os-autoinst/openQA/pull/3529 completely remove the base image, use a much smaller leap vanilla image and go with manually installing packages

The cache job looks fine in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45731 but "t" fails in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45735 due to missing tar in the base image.

Created https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3529_use_tiny_base/base/Dockerfile?expand=1 which uses leap but only installs tar.

EDIT: … and gzip. and then postgresql-server in ci-packages.txt . Now https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45912 fails as postgresql refuses to run as root, https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45913 fails to find the "shellcheck" command even though. I guess I can go back and keep the user.

Created https://github.com/os-autoinst/openQA/pull/3536 to just strip down the package installation within the base package without changing the other instructions in the container. Called osc branch home:okurz:branches:devel:openQA:pr3528_use_devel_package_in_base_image base home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages, adjusted https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/base/_service?expand=1 to point to the correct branch https://github.com/okurz/openQA/tree/fix/circleci_cache_base_image_less_packages, set https://build.opensuse.org/projects/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/meta to publish enable, looked up the published container on https://registry.opensuse.org/cgi-bin/cooverview?srch_term=project%3D%5Ehome%3Aokurz , created https://github.com/os-autoinst/openQA/pull/3537 with a temporary test using the temporary image

Actions

Copy link

#15

Updated by okurz over 4 years ago

Due date set to 2020-11-13
Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/3530 merged.

https://github.com/os-autoinst/openQA/pull/3536 is still waiting for an image to be published on registry.o.o , seems like that is slow to publish today. https://build.opensuse.org/package/show/devel:openQA:ci/base shows currently "failed" and a build log from 2020-11-09. Something is fishy but waiting solves many problems :)

Actions

Copy link

#16

Updated by okurz over 4 years ago

Note by andriinikitin: "But indeed it should be faster than pulling it from registry. It looks it is CircleCI performance to blame for why it started failing. Now: pull stats: download 569.5MiB in 18m58.188s (512.4KiB/s) Back in June: pull stats: download 532.2MiB in 1m2.258s (8.548MiB/s). Or maybe it is SUSE provider got worse"

Actions

Copy link

#17

Updated by okurz over 4 years ago

Status changed from Feedback to In Progress

had support by andriinikitin and I am continuing with this. Feedback cycles are a bit long because I am relying on circleCI in the loop so can't just test it locally. Well, could use https://circleci.com/docs/2.0/local-cli/ :)

Actions

Copy link

#18

Updated by okurz over 4 years ago

Due date changed from 2020-11-13 to 2020-11-20
Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/3536 looks ok now, waiting for review.

Actions

Copy link

#19

Updated by okurz over 4 years ago

Due date changed from 2020-11-20 to 2020-11-24

PR merged. https://build.opensuse.org/package/view_file/devel:openQA:ci/base/_service:download_url:Dockerfile?expand=1 was automatically updated.

I will wait until this is built and at least one CI job passed.

Actions

Copy link

#20

Updated by okurz over 4 years ago

After multiple problems I have for now partially reverted my change. Will check if the again updated base image is ok again in CI tests.

Actions

Copy link

#21

Updated by okurz over 4 years ago

Due date changed from 2020-11-24 to 2020-12-15

andrii-suse provided a helpful comment in https://github.com/os-autoinst/openQA/pull/3572#issuecomment-731512892
what we could try in the second try to bring in the approach of the stripped down CI image:

@okurz Ugh I see what went wrong with #3536 .
It built new cache on top of old ci:base image, so many packages were not included into cache, because they were not downloaded, because they were already installed.
Then ci:base was rebuilt according to Dockerfile changes, and next PR did pull updated ci:base image and tried to install the cache packages on top of it, which obviously failed.
I guess we can just introduce ci:base_v1 or ci:minimal for such purposes, instead of dealing with chicken/egg problem. :\ Then retry #3536

But for now I would like to monitor if the circleCI jobs actually still fail with the same rate as in before. The last Nbg datacenter power outage has caused quite some "changes" so maybe even download from registry.opensuse.org behaves different now :D

Actions

Copy link

#22

Updated by okurz over 4 years ago

Due date changed from 2020-12-15 to 2021-01-27
Priority changed from High to Low

I have not seen timeouts in past days after the nbg power outage 2020-11-18. Maybe the fresh boot of so many systems helped to resolve some issues :D

I will reduce prio now and we can check back later if we want to warm up the same approach again.

Actions

Copy link

#23

Updated by okurz over 4 years ago

Due date deleted (~~2021-01-27~~)
Status changed from Feedback to New
Assignee deleted (~~okurz~~)
Target version changed from Ready to future

not planning to continue unless we see the errors again.

Actions

Copy link

#24

Updated by okurz over 1 year ago

Related to action #152941: circleCI job runs into 20m timeout due to slow download from registry.opensuse.org added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #67855

[tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz almost 5 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 1 year ago