action #67855
open[tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org
0%
Description
Observation¶
I have seen an observation similar to the following often in past weeks, latest example https://app.circleci.com/pipelines/github/os-autoinst/openQA/3231/workflows/17a0a4f7-8fe9-47e3-a50d-191a87c75cc8/jobs/30830/steps:
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
81e5221d78cb: Pulling fs layer
3e86cb3bb3f5: Pulling fs layer
81e5221d78cb: Verifying Checksum
81e5221d78cb: Download complete
81e5221d78cb: Pull complete
error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest
Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest
so while it looks like the layers of the container are downloaded still after 10m there is the error error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest
. Probably the base 81e5221d78cb is present in cache but 3e86cb3bb3f5 fails to load. Locally I can see that this blob has 492 MiB so maybe it's just a bit big.
Acceptance criteria¶
- AC1: Multiple circleci jobs succeed to create the container based environment
Suggestions¶
Maybe it can already help if we use-> nope, see #67855#note-1--no-recommends
within the package installs within Dockerfile to reduce the size of layers- Add a retry within the circleci jobs or on the level of how circleci jobs are triggered
Updated by okurz over 4 years ago
- Description updated (diff)
I checked comparing "--no-recommends", "--recommends" and default using e.g.:
podman run --rm -it registry.opensuse.org/opensuse/tumbleweed zypper -n in --dry-run --recommends autoconf automake gcc-c++ libtool pkgconfig\(opencv\) pkg-config perl\(Module::CPANfile\) pkgconfig\(fftw3\) pkgconfig\(libpng\) pkgconfig\(sndfile\) pkgconfig\(theoraenc\) make rubygem\(sass\) python3-base python3-requests python3-future git-core rsync curl postgresql-devel postgresql-server qemu qemu-kvm qemu-tools tar xorg-x11-fonts sudo chromedriver
and found no difference between default and "--no-recommends", what would install 1.6 GiB. With "--recommends" I would get 2.2 GiB so that won't help.
Updated by okurz about 4 years ago
- Subject changed from circleci often abort in "cache" unable to read container image from registry.opensuse.org to [tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org
Updated by okurz about 4 years ago
- Priority changed from Normal to Low
- Target version changed from Ready to future
I have not observed this lately. We can bring it back to our backlog when we see it again more than once.
Updated by okurz about 4 years ago
- Priority changed from Low to Normal
- Target version changed from future to Ready
happened again failing our automatic dependency update jobs: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42253 and https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42254 , similar output:
Build-agent version 1.0.40192-8f9036ea (2020-09-29T09:45:01+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux d801a2f9b317 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
2d021088: Pulling fs layer
c64518df: Downloading [=================> ] 180.6MB/526.4MB
Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying
image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
2d021088: Pulling fs layer
c64518df: Downloading [==================> ] 195.4MB/526.4MB
context deadline exceeded
I don't find good information how to handled "context deadline exceeded". I only know of https://support.circleci.com/hc/en-us/articles/360045268074-Build-Fails-with-Too-long-with-no-output-exceeded-10m0s-context-deadline-exceeded- regarding the "no output timeout".
I have found nothing regarding retrying when within the "Spin Up Environment" step.
I see it as problematic that we download a 500MB+ image in every job and have not much control over it. Maybe we can either optimize the image, move some package installations to a later step or not even use the image until in a later step?
Updated by okurz almost 4 years ago
- Copied to action #72316: [tests][ci] circleci can fail in `zypper ref` due to temporary repository problems added
Updated by okurz almost 4 years ago
- Status changed from Workable to Feedback
- Assignee set to okurz
We will try with just a different time https://github.com/os-autoinst/openQA/pull/3451
Updated by okurz almost 4 years ago
- Due date set to 2020-10-21
If the different time slot works good enough we might be ok with just this change then.
Updated by okurz almost 4 years ago
- Status changed from Feedback to Resolved
So after many nights where the nightly pipeline failed the one from yesterday just passed fine at the new time: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4447/workflows/78a764f1-2b6b-4b91-b920-3e11bc57c1cc
I consider it suboptimal that the repeated image downloading of 500MiB takes 3-5 minutes but it seems we might have found a good approach for the problem at hand. Thanks tinita for the good idea of just running the pipeline at a different time of day.
Updated by okurz almost 4 years ago
- Due date deleted (
2020-10-21) - Status changed from Resolved to Workable
- Assignee deleted (
okurz)
so much for that :D same output as in #67855#note-7
I researched again if I could find anything regarding setting timeouts on circleCI but again the only real think I found is "no_output_timeout" for "run" commands but not for the "Set up environment" part.
I am currently out of ideas except for ditching circleci :(
Updated by okurz almost 4 years ago
- Target version changed from Ready to future
Updated by okurz almost 4 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
- Priority changed from Normal to High
- Target version changed from future to Ready
this now hits PRs of users reproducibly as well where even retry does not help, see https://github.com/os-autoinst/openQA/pull/3519#issuecomment-724252129
https://discuss.circleci.com/t/speed-up-spin-up-environment-with-custom-easticsearch-docker-image/34393/2 explains that we can try to have either more layers preserved in the container image or do more run-time installations and less up-front.
https://github.community/t/speed-comparison-github-actions-takes-3m-32s-circle-ci-takes-12s/17780 shows that circleCI can be quite fast, e.g. compared to github actions but this only works if using container images that can use the circleCI base images which are for example described in https://discuss.circleci.com/t/new-ubuntu-base-convenience-image-public-beta/33129 which are Ubuntu based. Does not help us much.
I am not sure if I have ever seen an image being already present on a host so we need to redownload every time. This is also what others report: https://discuss.circleci.com/t/spin-up-environment-docker-cache/11146/12
The combination of a heavy image in combination with downloading from slow registry.opensuse.org to circleCI workers have a significant impact here.
I am trying different approaches now:
- https://github.com/os-autoinst/openQA/pull/3527 specifying "NoSquash" option for the base image
This looks promising so far. Instead of
Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 622ad35ed331 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
dbbc37c1: Pulling fs layer
e03f6954: Downloading [======================> ] 239.2MB/542.9MB
Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying
image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
dbbc37c1: Pulling fs layer
e03f6954: Downloading [=======================> ] 255MB/542.9MB
context deadline exceeded
we get
Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 87f84206b28e 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/home/okurz/openqa/containers/base:latest
image cache not found on this host, downloading registry.opensuse.org/home/okurz/openqa/containers/base:latest
latest: Pulling from home/okurz/openqa/containers/base
39a0c59f: Pulling fs layer
9ab4a673: Pulling fs layer
427d1320: Pulling fs layer
7a9b7c44: Pulling fs layer
bdf73f9f: Pulling fs layer
de1ea851: Pulling fs layer
30b208e0: Pulling fs layer
3048358e: Pulling fs layer
a803434d: Pulling fs layer
0e2303e7: Pulling fs layer
7a9b7c44: Downloading [> ] 2.113MB/254.2MB
7a9b7c44: Downloading [> ] 3.7MB/254.2MB
7a9b7c44: Downloading [> ] 4.227MB/254.2MB
7a9b7c44: Downloading [=> ] 5.284MB/254.2MB
bdf73f9f: Downloading [> ] 525.4kB/116.7MB
9a0c59f: Downloading [===> ] 2.972MB/42.34MB
9a0c59f: Downloading [====> ] 3.396MB/42.34MB
7a9b7c44: Downloading [=> ] 6.865MB/254.2MB
7a9b7c44: Downloading [=> ] 7.396MB/254.2MB
…
de1ea851: Downloading [=================================================> ] 184.3MB/185.3MB
de1ea851: Downloading [=================================================> ] 184.8MB/185.3MB
de1ea851: Download complete ====================================> ] 213.8MB/254.2MB
7a9b7c44: Downloading [==========================================> ] 214.4MB/254.2MB
7a9b7c44: Downloading [==========================================> ] 214.9MB/254.2MB
…
7a9b7c44: Downloading [=================================================> ] 254MB/254.2MB
7a9b7c44: Extracting [==> ] 15.04MB/254.2MB
7a9b7c44: Extracting [====> ] 23.4MB/254.2MB
https://github.com/os-autoinst/openQA/pull/3528 using an image that just install
openQA-devel
in one step -> image is 550MB so even 10MB bigger than defaulthttps://github.com/os-autoinst/openQA/pull/3529 completely remove the base image, use a much smaller leap vanilla image and go with manually installing packages
The cache job looks fine in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45731 but "t" fails in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45735 due to missing tar in the base image.
Created https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3529_use_tiny_base/base/Dockerfile?expand=1 which uses leap but only installs tar.
EDIT: … and gzip. and then postgresql-server in ci-packages.txt . Now https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45912 fails as postgresql refuses to run as root, https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45913 fails to find the "shellcheck" command even though. I guess I can go back and keep the user.
Created https://github.com/os-autoinst/openQA/pull/3536 to just strip down the package installation within the base package without changing the other instructions in the container. Called osc branch home:okurz:branches:devel:openQA:pr3528_use_devel_package_in_base_image base home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages
, adjusted https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/base/_service?expand=1 to point to the correct branch https://github.com/okurz/openQA/tree/fix/circleci_cache_base_image_less_packages, set https://build.opensuse.org/projects/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/meta to publish enable, looked up the published container on https://registry.opensuse.org/cgi-bin/cooverview?srch_term=project%3D%5Ehome%3Aokurz , created https://github.com/os-autoinst/openQA/pull/3537 with a temporary test using the temporary image
Updated by okurz almost 4 years ago
- Due date set to 2020-11-13
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/3530 merged.
https://github.com/os-autoinst/openQA/pull/3536 is still waiting for an image to be published on registry.o.o , seems like that is slow to publish today. https://build.opensuse.org/package/show/devel:openQA:ci/base shows currently "failed" and a build log from 2020-11-09. Something is fishy but waiting solves many problems :)
Updated by okurz almost 4 years ago
Note by andriinikitin: "But indeed it should be faster than pulling it from registry. It looks it is CircleCI performance to blame for why it started failing. Now: pull stats: download 569.5MiB in 18m58.188s (512.4KiB/s) Back in June: pull stats: download 532.2MiB in 1m2.258s (8.548MiB/s). Or maybe it is SUSE provider got worse"
Updated by okurz almost 4 years ago
- Status changed from Feedback to In Progress
had support by andriinikitin and I am continuing with this. Feedback cycles are a bit long because I am relying on circleCI in the loop so can't just test it locally. Well, could use https://circleci.com/docs/2.0/local-cli/ :)
Updated by okurz almost 4 years ago
- Due date changed from 2020-11-13 to 2020-11-20
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/3536 looks ok now, waiting for review.
Updated by okurz almost 4 years ago
- Due date changed from 2020-11-20 to 2020-11-24
PR merged. https://build.opensuse.org/package/view_file/devel:openQA:ci/base/_service:download_url:Dockerfile?expand=1 was automatically updated.
I will wait until this is built and at least one CI job passed.
Updated by okurz almost 4 years ago
After multiple problems I have for now partially reverted my change. Will check if the again updated base image is ok again in CI tests.
Updated by okurz almost 4 years ago
- Due date changed from 2020-11-24 to 2020-12-15
andrii-suse provided a helpful comment in https://github.com/os-autoinst/openQA/pull/3572#issuecomment-731512892
what we could try in the second try to bring in the approach of the stripped down CI image:
@okurz Ugh I see what went wrong with #3536 .
It built new cache on top of old ci:base image, so many packages were not included into cache, because they were not downloaded, because they were already installed.
Then ci:base was rebuilt according to Dockerfile changes, and next PR did pull updated ci:base image and tried to install the cache packages on top of it, which obviously failed.
I guess we can just introduce ci:base_v1 or ci:minimal for such purposes, instead of dealing with chicken/egg problem. :\ Then retry #3536
But for now I would like to monitor if the circleCI jobs actually still fail with the same rate as in before. The last Nbg datacenter power outage has caused quite some "changes" so maybe even download from registry.opensuse.org behaves different now :D
Updated by okurz almost 4 years ago
- Due date changed from 2020-12-15 to 2021-01-27
- Priority changed from High to Low
I have not seen timeouts in past days after the nbg power outage 2020-11-18. Maybe the fresh boot of so many systems helped to resolve some issues :D
I will reduce prio now and we can check back later if we want to warm up the same approach again.
Updated by okurz almost 4 years ago
- Due date deleted (
2021-01-27) - Status changed from Feedback to New
- Assignee deleted (
okurz) - Target version changed from Ready to future
not planning to continue unless we see the errors again.
Updated by okurz 9 months ago
- Related to action #152941: circleCI job runs into 20m timeout due to slow download from registry.opensuse.org added