[tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org
I have seen an observation similar to the following often in past weeks, latest example https://app.circleci.com/pipelines/github/os-autoinst/openQA/3231/workflows/17a0a4f7-8fe9-47e3-a50d-191a87c75cc8/jobs/30830/steps:
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest latest: Pulling from devel/openqa/ci/containers/base 81e5221d78cb: Pulling fs layer 3e86cb3bb3f5: Pulling fs layer 81e5221d78cb: Verifying Checksum 81e5221d78cb: Download complete 81e5221d78cb: Pull complete error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest
so while it looks like the layers of the container are downloaded still after 10m there is the error
error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest. Probably the base 81e5221d78cb is present in cache but 3e86cb3bb3f5 fails to load. Locally I can see that this blob has 492 MiB so maybe it's just a bit big.
- AC1: Multiple circleci jobs succeed to create the container based environment
Maybe it can already help if we use-> nope, see #67855#note-1
--no-recommendswithin the package installs within Dockerfile to reduce the size of layers
- Add a retry within the circleci jobs or on the level of how circleci jobs are triggered
#2 Updated by okurz about 1 year ago
- Description updated (diff)
I checked comparing "--no-recommends", "--recommends" and default using e.g.:
podman run --rm -it registry.opensuse.org/opensuse/tumbleweed zypper -n in --dry-run --recommends autoconf automake gcc-c++ libtool pkgconfig\(opencv\) pkg-config perl\(Module::CPANfile\) pkgconfig\(fftw3\) pkgconfig\(libpng\) pkgconfig\(sndfile\) pkgconfig\(theoraenc\) make rubygem\(sass\) python3-base python3-requests python3-future git-core rsync curl postgresql-devel postgresql-server qemu qemu-kvm qemu-tools tar xorg-x11-fonts sudo chromedriver
and found no difference between default and "--no-recommends", what would install 1.6 GiB. With "--recommends" I would get 2.2 GiB so that won't help.
- Priority changed from Low to Normal
- Target version changed from future to Ready
happened again failing our automatic dependency update jobs: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42253 and https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42254 , similar output:
Build-agent version 1.0.40192-8f9036ea (2020-09-29T09:45:01+0000) Docker Engine Version: 18.09.6 Kernel Version: Linux d801a2f9b317 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest latest: Pulling from devel/openqa/ci/containers/base 2d021088: Pulling fs layer c64518df: Downloading [=================> ] 180.6MB/526.4MB Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest latest: Pulling from devel/openqa/ci/containers/base 2d021088: Pulling fs layer c64518df: Downloading [==================> ] 195.4MB/526.4MB context deadline exceeded
I don't find good information how to handled "context deadline exceeded". I only know of https://support.circleci.com/hc/en-us/articles/360045268074-Build-Fails-with-Too-long-with-no-output-exceeded-10m0s-context-deadline-exceeded- regarding the "no output timeout".
I have found nothing regarding retrying when within the "Spin Up Environment" step.
I see it as problematic that we download a 500MB+ image in every job and have not much control over it. Maybe we can either optimize the image, move some package installations to a later step or not even use the image until in a later step?
- Status changed from Feedback to Resolved
So after many nights where the nightly pipeline failed the one from yesterday just passed fine at the new time: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4447/workflows/78a764f1-2b6b-4b91-b920-3e11bc57c1cc
I consider it suboptimal that the repeated image downloading of 500MiB takes 3-5 minutes but it seems we might have found a good approach for the problem at hand. Thanks tinita for the good idea of just running the pipeline at a different time of day.
- Due date deleted (
- Status changed from Resolved to Workable
- Assignee deleted (
so much for that :D same output as in #67855#note-7
I researched again if I could find anything regarding setting timeouts on circleCI but again the only real think I found is "no_output_timeout" for "run" commands but not for the "Set up environment" part.
I am currently out of ideas except for ditching circleci :(
- Status changed from Workable to In Progress
- Assignee set to okurz
- Priority changed from Normal to High
- Target version changed from future to Ready
this now hits PRs of users reproducibly as well where even retry does not help, see https://github.com/os-autoinst/openQA/pull/3519#issuecomment-724252129
https://discuss.circleci.com/t/speed-up-spin-up-environment-with-custom-easticsearch-docker-image/34393/2 explains that we can try to have either more layers preserved in the container image or do more run-time installations and less up-front.
https://github.community/t/speed-comparison-github-actions-takes-3m-32s-circle-ci-takes-12s/17780 shows that circleCI can be quite fast, e.g. compared to github actions but this only works if using container images that can use the circleCI base images which are for example described in https://discuss.circleci.com/t/new-ubuntu-base-convenience-image-public-beta/33129 which are Ubuntu based. Does not help us much.
I am not sure if I have ever seen an image being already present on a host so we need to redownload every time. This is also what others report: https://discuss.circleci.com/t/spin-up-environment-docker-cache/11146/12
The combination of a heavy image in combination with downloading from slow registry.opensuse.org to circleCI workers have a significant impact here.
I am trying different approaches now:
- https://github.com/os-autoinst/openQA/pull/3527 specifying "NoSquash" option for the base image
This looks promising so far. Instead of
Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000) Docker Engine Version: 18.09.6 Kernel Version: Linux 622ad35ed331 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest latest: Pulling from devel/openqa/ci/containers/base dbbc37c1: Pulling fs layer e03f6954: Downloading [======================> ] 239.2MB/542.9MB Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest latest: Pulling from devel/openqa/ci/containers/base dbbc37c1: Pulling fs layer e03f6954: Downloading [=======================> ] 255MB/542.9MB context deadline exceeded
Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000) Docker Engine Version: 18.09.6 Kernel Version: Linux 87f84206b28e 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux Starting container registry.opensuse.org/home/okurz/openqa/containers/base:latest image cache not found on this host, downloading registry.opensuse.org/home/okurz/openqa/containers/base:latest latest: Pulling from home/okurz/openqa/containers/base 39a0c59f: Pulling fs layer 9ab4a673: Pulling fs layer 427d1320: Pulling fs layer 7a9b7c44: Pulling fs layer bdf73f9f: Pulling fs layer de1ea851: Pulling fs layer 30b208e0: Pulling fs layer 3048358e: Pulling fs layer a803434d: Pulling fs layer 0e2303e7: Pulling fs layer 7a9b7c44: Downloading [> ] 2.113MB/254.2MB 7a9b7c44: Downloading [> ] 3.7MB/254.2MB 7a9b7c44: Downloading [> ] 4.227MB/254.2MB 7a9b7c44: Downloading [=> ] 5.284MB/254.2MB bdf73f9f: Downloading [> ] 525.4kB/116.7MB 9a0c59f: Downloading [===> ] 2.972MB/42.34MB 9a0c59f: Downloading [====> ] 3.396MB/42.34MB 7a9b7c44: Downloading [=> ] 6.865MB/254.2MB 7a9b7c44: Downloading [=> ] 7.396MB/254.2MB … de1ea851: Downloading [=================================================> ] 184.3MB/185.3MB de1ea851: Downloading [=================================================> ] 184.8MB/185.3MB de1ea851: Download complete ====================================> ] 213.8MB/254.2MB 7a9b7c44: Downloading [==========================================> ] 214.4MB/254.2MB 7a9b7c44: Downloading [==========================================> ] 214.9MB/254.2MB … 7a9b7c44: Downloading [=================================================> ] 254MB/254.2MB 7a9b7c44: Extracting [==> ] 15.04MB/254.2MB 7a9b7c44: Extracting [====> ] 23.4MB/254.2MB
https://github.com/os-autoinst/openQA/pull/3528 using an image that just install
openQA-develin one step -> image is 550MB so even 10MB bigger than default
https://github.com/os-autoinst/openQA/pull/3529 completely remove the base image, use a much smaller leap vanilla image and go with manually installing packages
The cache job looks fine in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45731 but "t" fails in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45735 due to missing tar in the base image.
Created https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3529_use_tiny_base/base/Dockerfile?expand=1 which uses leap but only installs tar.
EDIT: … and gzip. and then postgresql-server in ci-packages.txt . Now https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45912 fails as postgresql refuses to run as root, https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45913 fails to find the "shellcheck" command even though. I guess I can go back and keep the user.
Created https://github.com/os-autoinst/openQA/pull/3536 to just strip down the package installation within the base package without changing the other instructions in the container. Called
osc branch home:okurz:branches:devel:openQA:pr3528_use_devel_package_in_base_image base home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages, adjusted https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/base/_service?expand=1 to point to the correct branch https://github.com/okurz/openQA/tree/fix/circleci_cache_base_image_less_packages, set https://build.opensuse.org/projects/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/meta to publish enable, looked up the published container on https://registry.opensuse.org/cgi-bin/cooverview?srch_term=project%3D%5Ehome%3Aokurz , created https://github.com/os-autoinst/openQA/pull/3537 with a temporary test using the temporary image
- Due date set to 2020-11-13
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/3536 is still waiting for an image to be published on registry.o.o , seems like that is slow to publish today. https://build.opensuse.org/package/show/devel:openQA:ci/base shows currently "failed" and a build log from 2020-11-09. Something is fishy but waiting solves many problems :)
Note by andriinikitin: "But indeed it should be faster than pulling it from registry. It looks it is CircleCI performance to blame for why it started failing. Now: pull stats: download 569.5MiB in 18m58.188s (512.4KiB/s) Back in June: pull stats: download 532.2MiB in 1m2.258s (8.548MiB/s). Or maybe it is SUSE provider got worse"
- Status changed from Feedback to In Progress
had support by andriinikitin and I am continuing with this. Feedback cycles are a bit long because I am relying on circleCI in the loop so can't just test it locally. Well, could use https://circleci.com/docs/2.0/local-cli/ :)
- Due date changed from 2020-11-20 to 2020-11-24
PR merged. https://build.opensuse.org/package/view_file/devel:openQA:ci/base/_service:download_url:Dockerfile?expand=1 was automatically updated.
I will wait until this is built and at least one CI job passed.
- Due date changed from 2020-11-24 to 2020-12-15
andrii-suse provided a helpful comment in https://github.com/os-autoinst/openQA/pull/3572#issuecomment-731512892
what we could try in the second try to bring in the approach of the stripped down CI image:
okurz Ugh I see what went wrong with #3536 .
It built new cache on top of old ci:base image, so many packages were not included into cache, because they were not downloaded, because they were already installed.
Then ci:base was rebuilt according to Dockerfile changes, and next PR did pull updated ci:base image and tried to install the cache packages on top of it, which obviously failed.
I guess we can just introduce ci:base_v1 or ci:minimal for such purposes, instead of dealing with chicken/egg problem. :\ Then retry #3536
But for now I would like to monitor if the circleCI jobs actually still fail with the same rate as in before. The last Nbg datacenter power outage has caused quite some "changes" so maybe even download from registry.opensuse.org behaves different now :D
- Due date changed from 2020-12-15 to 2021-01-27
- Priority changed from High to Low
I have not seen timeouts in past days after the nbg power outage 2020-11-18. Maybe the fresh boot of so many systems helped to resolve some issues :D
I will reduce prio now and we can check back later if we want to warm up the same approach again.