Project

General

Profile

action #67855

[tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org

Added by okurz 10 months ago. Updated 5 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2020-06-09
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

I have seen an observation similar to the following often in past weeks, latest example https://app.circleci.com/pipelines/github/os-autoinst/openQA/3231/workflows/17a0a4f7-8fe9-47e3-a50d-191a87c75cc8/jobs/30830/steps:

Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base
81e5221d78cb: Pulling fs layer
3e86cb3bb3f5: Pulling fs layer
81e5221d78cb: Verifying Checksum
81e5221d78cb: Download complete
81e5221d78cb: Pull complete
  error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest

Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest

so while it looks like the layers of the container are downloaded still after 10m there is the error error starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest: Error: No such image: registry.opensuse.org/devel/openqa/ci/containers/base:latest. Probably the base 81e5221d78cb is present in cache but 3e86cb3bb3f5 fails to load. Locally I can see that this blob has 492 MiB so maybe it's just a bit big.

Acceptance criteria

  • AC1: Multiple circleci jobs succeed to create the container based environment

Suggestions

  • Maybe it can already help if we use --no-recommends within the package installs within Dockerfile to reduce the size of layers -> nope, see #67855#note-1
  • Add a retry within the circleci jobs or on the level of how circleci jobs are triggered

Related issues

Copied to openQA Project - action #72316: [tests][ci] circleci can fail in `zypper ref` due to temporary repository problemsResolved2020-06-09

History

#1 Updated by okurz 10 months ago

  • Description updated (diff)

#2 Updated by okurz 10 months ago

  • Description updated (diff)

I checked comparing "--no-recommends", "--recommends" and default using e.g.:

podman run --rm -it registry.opensuse.org/opensuse/tumbleweed zypper -n in --dry-run --recommends autoconf automake gcc-c++ libtool pkgconfig\(opencv\) pkg-config perl\(Module::CPANfile\) pkgconfig\(fftw3\) pkgconfig\(libpng\) pkgconfig\(sndfile\) pkgconfig\(theoraenc\) make rubygem\(sass\) python3-base python3-requests python3-future git-core rsync curl postgresql-devel postgresql-server qemu qemu-kvm qemu-tools tar xorg-x11-fonts sudo chromedriver

and found no difference between default and "--no-recommends", what would install 1.6 GiB. With "--recommends" I would get 2.2 GiB so that won't help.

#3 Updated by okurz 9 months ago

  • Target version set to Ready

#4 Updated by okurz 9 months ago

  • Status changed from New to Workable

#5 Updated by okurz 9 months ago

  • Subject changed from circleci often abort in "cache" unable to read container image from registry.opensuse.org to [tests][ci] circleci often abort in "cache" unable to read container image from registry.opensuse.org

#6 Updated by okurz 7 months ago

  • Priority changed from Normal to Low
  • Target version changed from Ready to future

I have not observed this lately. We can bring it back to our backlog when we see it again more than once.

#7 Updated by okurz 7 months ago

  • Priority changed from Low to Normal
  • Target version changed from future to Ready

happened again failing our automatic dependency update jobs: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42253 and https://app.circleci.com/pipelines/github/os-autoinst/openQA/4392/workflows/74be66a8-9770-43bb-9916-7a5e769f7cd9/jobs/42254 , similar output:

Build-agent version 1.0.40192-8f9036ea (2020-09-29T09:45:01+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux d801a2f9b317 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

2d021088: Pulling fs layer 
c64518df: Downloading [=================>                                 ]  180.6MB/526.4MB
  Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

2d021088: Pulling fs layer 
c64518df: Downloading [==================>                                ]  195.4MB/526.4MB

context deadline exceeded

I don't find good information how to handled "context deadline exceeded". I only know of https://support.circleci.com/hc/en-us/articles/360045268074-Build-Fails-with-Too-long-with-no-output-exceeded-10m0s-context-deadline-exceeded- regarding the "no output timeout".

I have found nothing regarding retrying when within the "Spin Up Environment" step.

I see it as problematic that we download a 500MB+ image in every job and have not much control over it. Maybe we can either optimize the image, move some package installations to a later step or not even use the image until in a later step?

#8 Updated by okurz 6 months ago

  • Copied to action #72316: [tests][ci] circleci can fail in `zypper ref` due to temporary repository problems added

#9 Updated by okurz 6 months ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

We will try with just a different time https://github.com/os-autoinst/openQA/pull/3451

#10 Updated by okurz 6 months ago

  • Due date set to 2020-10-21

If the different time slot works good enough we might be ok with just this change then.

#11 Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

So after many nights where the nightly pipeline failed the one from yesterday just passed fine at the new time: https://app.circleci.com/pipelines/github/os-autoinst/openQA/4447/workflows/78a764f1-2b6b-4b91-b920-3e11bc57c1cc

I consider it suboptimal that the repeated image downloading of 500MiB takes 3-5 minutes but it seems we might have found a good approach for the problem at hand. Thanks tinita for the good idea of just running the pipeline at a different time of day.

#12 Updated by okurz 6 months ago

  • Due date deleted (2020-10-21)
  • Status changed from Resolved to Workable
  • Assignee deleted (okurz)

https://app.circleci.com/pipelines/github/os-autoinst/openQA/4465/workflows/fc241df3-9822-4a09-90b0-a03ac4589dff

so much for that :D same output as in #67855#note-7

I researched again if I could find anything regarding setting timeouts on circleCI but again the only real think I found is "no_output_timeout" for "run" commands but not for the "Set up environment" part.

I am currently out of ideas except for ditching circleci :(

#13 Updated by okurz 6 months ago

  • Target version changed from Ready to future

#14 Updated by okurz 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version changed from future to Ready

this now hits PRs of users reproducibly as well where even retry does not help, see https://github.com/os-autoinst/openQA/pull/3519#issuecomment-724252129

https://discuss.circleci.com/t/speed-up-spin-up-environment-with-custom-easticsearch-docker-image/34393/2 explains that we can try to have either more layers preserved in the container image or do more run-time installations and less up-front.

https://github.community/t/speed-comparison-github-actions-takes-3m-32s-circle-ci-takes-12s/17780 shows that circleCI can be quite fast, e.g. compared to github actions but this only works if using container images that can use the circleCI base images which are for example described in https://discuss.circleci.com/t/new-ubuntu-base-convenience-image-public-beta/33129 which are Ubuntu based. Does not help us much.

I am not sure if I have ever seen an image being already present on a host so we need to redownload every time. This is also what others report: https://discuss.circleci.com/t/spin-up-environment-docker-cache/11146/12

The combination of a heavy image in combination with downloading from slow registry.opensuse.org to circleCI workers have a significant impact here.

I am trying different approaches now:

This looks promising so far. Instead of

Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 622ad35ed331 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/devel/openqa/ci/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

dbbc37c1: Pulling fs layer 
e03f6954: Downloading [======================>                            ]  239.2MB/542.9MB
  Error pulling image registry.opensuse.org/devel/openqa/ci/containers/base:latest: context deadline exceeded... retrying
  image cache not found on this host, downloading registry.opensuse.org/devel/openqa/ci/containers/base:latest
latest: Pulling from devel/openqa/ci/containers/base

dbbc37c1: Pulling fs layer 
e03f6954: Downloading [=======================>                           ]    255MB/542.9MB

context deadline exceeded

like in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4795/workflows/d766ccb1-ccf9-4c2e-a54b-2e6f7a110714/jobs/45709

we get

Build-agent version 1.0.44333-fe158151 (2020-11-09T19:49:56+0000)
Docker Engine Version: 18.09.6
Kernel Version: Linux 87f84206b28e 4.15.0-1077-aws #81-Ubuntu SMP Wed Jun 24 16:48:15 UTC 2020 x86_64 Linux
Starting container registry.opensuse.org/home/okurz/openqa/containers/base:latest
  image cache not found on this host, downloading registry.opensuse.org/home/okurz/openqa/containers/base:latest
latest: Pulling from home/okurz/openqa/containers/base

39a0c59f: Pulling fs layer 
9ab4a673: Pulling fs layer 
427d1320: Pulling fs layer 
7a9b7c44: Pulling fs layer 
bdf73f9f: Pulling fs layer 
de1ea851: Pulling fs layer 
30b208e0: Pulling fs layer 
3048358e: Pulling fs layer 
a803434d: Pulling fs layer 
0e2303e7: Pulling fs layer 
7a9b7c44: Downloading [>                                                  ]  2.113MB/254.2MB
7a9b7c44: Downloading [>                                                  ]    3.7MB/254.2MB
7a9b7c44: Downloading [>                                                  ]  4.227MB/254.2MB
7a9b7c44: Downloading [=>                                                 ]  5.284MB/254.2MB
bdf73f9f: Downloading [>                                                  ]  525.4kB/116.7MB
9a0c59f: Downloading [===>                                               ]  2.972MB/42.34MB
9a0c59f: Downloading [====>                                              ]  3.396MB/42.34MB
7a9b7c44: Downloading [=>                                                 ]  6.865MB/254.2MB
7a9b7c44: Downloading [=>                                                 ]  7.396MB/254.2MB
…
de1ea851: Downloading [=================================================> ]  184.3MB/185.3MB
de1ea851: Downloading [=================================================> ]  184.8MB/185.3MB
de1ea851: Download complete  ====================================>        ]  213.8MB/254.2MB
7a9b7c44: Downloading [==========================================>        ]  214.4MB/254.2MB
7a9b7c44: Downloading [==========================================>        ]  214.9MB/254.2MB
…
7a9b7c44: Downloading [=================================================> ]    254MB/254.2MB
7a9b7c44: Extracting [==>                                                ]  15.04MB/254.2MB
7a9b7c44: Extracting [====>                                              ]   23.4MB/254.2MB

The cache job looks fine in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45731 but "t" fails in https://app.circleci.com/pipelines/github/os-autoinst/openQA/4807/workflows/6c702455-a0fc-4196-a51a-dc3d50aa977f/jobs/45735 due to missing tar in the base image.

Created https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3529_use_tiny_base/base/Dockerfile?expand=1 which uses leap but only installs tar.

EDIT: … and gzip. and then postgresql-server in ci-packages.txt . Now https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45912 fails as postgresql refuses to run as root, https://app.circleci.com/pipelines/github/os-autoinst/openQA/4824/workflows/15630d74-7d73-46b4-8ef2-f05e4f7665cc/jobs/45913 fails to find the "shellcheck" command even though. I guess I can go back and keep the user.

Created https://github.com/os-autoinst/openQA/pull/3536 to just strip down the package installation within the base package without changing the other instructions in the container. Called osc branch home:okurz:branches:devel:openQA:pr3528_use_devel_package_in_base_image base home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages, adjusted https://build.opensuse.org/package/view_file/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/base/_service?expand=1 to point to the correct branch https://github.com/okurz/openQA/tree/fix/circleci_cache_base_image_less_packages, set https://build.opensuse.org/projects/home:okurz:branches:devel:openQA:pr3536_circleci_cache_base_image_less_packages/meta to publish enable, looked up the published container on https://registry.opensuse.org/cgi-bin/cooverview?srch_term=project%3D%5Ehome%3Aokurz , created https://github.com/os-autoinst/openQA/pull/3537 with a temporary test using the temporary image

#15 Updated by okurz 5 months ago

  • Due date set to 2020-11-13
  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/3530 merged.

https://github.com/os-autoinst/openQA/pull/3536 is still waiting for an image to be published on registry.o.o , seems like that is slow to publish today. https://build.opensuse.org/package/show/devel:openQA:ci/base shows currently "failed" and a build log from 2020-11-09. Something is fishy but waiting solves many problems :)

#16 Updated by okurz 5 months ago

Note by andriinikitin: "But indeed it should be faster than pulling it from registry. It looks it is CircleCI performance to blame for why it started failing. Now: pull stats: download 569.5MiB in 18m58.188s (512.4KiB/s) Back in June: pull stats: download 532.2MiB in 1m2.258s (8.548MiB/s). Or maybe it is SUSE provider got worse"

#17 Updated by okurz 5 months ago

  • Status changed from Feedback to In Progress

had support by andriinikitin and I am continuing with this. Feedback cycles are a bit long because I am relying on circleCI in the loop so can't just test it locally. Well, could use https://circleci.com/docs/2.0/local-cli/ :)

#18 Updated by okurz 5 months ago

  • Due date changed from 2020-11-13 to 2020-11-20
  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/3536 looks ok now, waiting for review.

#19 Updated by okurz 5 months ago

  • Due date changed from 2020-11-20 to 2020-11-24

PR merged. https://build.opensuse.org/package/view_file/devel:openQA:ci/base/_service:download_url:Dockerfile?expand=1 was automatically updated.

I will wait until this is built and at least one CI job passed.

#20 Updated by okurz 5 months ago

After multiple problems I have for now partially reverted my change. Will check if the again updated base image is ok again in CI tests.

#21 Updated by okurz 5 months ago

  • Due date changed from 2020-11-24 to 2020-12-15

andrii-suse provided a helpful comment in https://github.com/os-autoinst/openQA/pull/3572#issuecomment-731512892
what we could try in the second try to bring in the approach of the stripped down CI image:

okurz Ugh I see what went wrong with #3536 .
It built new cache on top of old ci:base image, so many packages were not included into cache, because they were not downloaded, because they were already installed.
Then ci:base was rebuilt according to Dockerfile changes, and next PR did pull updated ci:base image and tried to install the cache packages on top of it, which obviously failed.
I guess we can just introduce ci:base_v1 or ci:minimal for such purposes, instead of dealing with chicken/egg problem. :\ Then retry #3536

But for now I would like to monitor if the circleCI jobs actually still fail with the same rate as in before. The last Nbg datacenter power outage has caused quite some "changes" so maybe even download from registry.opensuse.org behaves different now :D

#22 Updated by okurz 5 months ago

  • Due date changed from 2020-12-15 to 2021-01-27
  • Priority changed from High to Low

I have not seen timeouts in past days after the nbg power outage 2020-11-18. Maybe the fresh boot of so many systems helped to resolve some issues :D

I will reduce prio now and we can check back later if we want to warm up the same approach again.

#23 Updated by okurz 5 months ago

  • Due date deleted (2021-01-27)
  • Status changed from Feedback to New
  • Assignee deleted (okurz)
  • Target version changed from Ready to future

not planning to continue unless we see the errors again.

Also available in: Atom PDF