Project

General

Profile

Actions

action #43823

closed

o3 workers immediately incompleting all jobs, caching service can not be reached

Added by okurz about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
2018-11-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

[15/11/2018 08:37:44] <fvogt> openQA is down: "result: setup failure: Cache service not available.". Just a usual Thursday?
[15/11/2018 08:38:15] <DimStar> already 4 hours ago - so a bit too early for a 'usual' thursday
…
[15/11/2018 09:14:22] <DimStar> okurz: who do I kick to fix o3/caching? :)
…
[15/11/2018 09:23:03] <okurz> what is the systemd service openqa-worker@multi-user?
[15/11/2018 09:28:14] <okurz> DimStar, sysrich: I suspect tonights upgrade of the o3 openQA workers broke it. I will try to find out the diff in the versions and see if I can revert something

Further investigation

From openqaworker1:/var/log/zypper.log:

2018-11-15 00:55:26 <1> openqaworker1(5900) [zypper++] Summary.cc(readPool):135 <uninstall> I_TsU(43455)openQA-worker-4.6.1542112053.3a061720-lp150.870.1.noar
ch(@System)
2018-11-15 00:55:41 <1> openqaworker1(5900) [zypp] PackageProvider.cc(providePackage):412 provide Package (38)openQA-worker-4.6.1542208789.64bcff24-lp150.880.1.noarch(devel_openQA)

-> package openQA-worker got updated from 4.6.1542112053.3a061720-lp150.870.1 to 4.6.1542208789.64bcff24-lp150.880.1 which has the following git commits included:

64bcff24 Add 2 new dependencies to the travis container: (#1864)
3a07cfa6 docs: Extend definition of *softfailed* to "known issues" (#1837)
7d4d9727 Refuse updating job group name with empty or blank (#1844)
fcf42650 Merge pull request #1852 from Martchus/show_only_latest_on_overview
bb46c55c Apparmor: Add /dev/hugepages (#1857)
7c800f79 Fix broken in section of inter machine dependencies (#1859)
9da05482 Merge pull request #1783 from mudler/separate_minion_cache_service
772ee569 Wrap up and polish
4e04c40e Do not use wantarray in GRU's enqueue()
b6354356 Update POD OpenQA::Worker::Cache::Client's SYNOPSIS
6d0f1b56 Avoid to race to get available minion workers
968ed4c9 Drop PID from log
8c7989c4 Move Minion workers under a single service
20cd5a5e Cleanup
5120b9df Fix typo
ea55d577 Adapt gru tests
84c508f2 Generate same lock for asset/host in the request object
b95c4105 Fix rsync output in minion task
b10f6265 Refactor and update Documentation
8f94fbcf Use the cache service to perform rsync tests caching
eb302f01 Create Minion task for rsync execution
aed9eda6 Use the host folder to cache tests/needles
4709b255 Do not check if child is running if it's not there
817b8669 Reduce the lock expiration time
262fe2cf Trim spaces from left and right in assets name
f2cefd87 Hold the db handle during transactions
73b059b5 Add test for Minion task registration and execution
c6bb9043 specfile: DBD-SQLite needs to be > 1.51, required by Minion::Backend::SQLite
b6c814c6 Add tests and cover also subprocess execution
a98f78fb Add function to generate session token
b559f9eb Read and set limit from config file, defaults to 50GB
88a224b1 Makefile: install Worker Cache Service systemd unit
f5268886 Add POD for Cache Service and extend docs/Installing.asciidoc
353c0bf5 Add service file and adapt openQA.spec
59e725c8 Avoid useless quotes on hash key
0857b29d Debug - adapt compile-check-all
976830b3 Address to PR comments - also add test for checking output in cache-service.t
52eee59c Drop unused function
910991c2 Write cache output into os-autoinst log
f86e165f Make possible to retrieve cache output
a0eb5c56 Add debug messages while enqueuing download
0b88e3f6 Asset path needs to be a string for vars
ff4f68af Make dequeue and status post
4eeef27f Adapt fullstack to cache path change
2c28bcf0 Extract domain parsing to _base_url and use it also in the client to resolve asset path
6104562c Adapt tests
09cb313c Get the cache path right
806560ec Incomplete job if cache service is down
3e62a820 tests: Wait also for worker minion in fullstack
3248c99c Use defaults in isotovideo engine
e6d6a321 Update defaults and remove protoyped function from Cache::Client
3ea979ae tests: Add test for available_workers()
aa0926f1 Add available_workers() to Cache::Client
ce3a5011 Fullstack: move sqlite cache check up
a280d126 Cache: make retry() return always the response object
4d2d04d4 Fullstack: Wait cache to be ready
f4d6e2dd Adapt tests and utils
378f61ea Drop unneeded default
69978a7f Use OpenQA::Worker::Client inside the cache task
dfaeca5b Start to adapt fullstack test
85a4a839 Adapt to PBP
7a52f2a1 Add dep on Minion::Backend::SQLite
ce0a9905 Move common testing functions to Test::Utils as they will be reused also in fullstack test
9393d5e0 Add Mojo::SQLite dependency
f1a1c59e Use the Cache service within the worker
b48c9196 Add accessors to retrieve asset path and check if exists
11605149 Remove unneded log line and cleanup check_limits()
85bf029d Finishing porting the Cache.pm code to Mojo::SQLite
b7f63eae Rework cleanup strategy, avoid endless loops - update etag fails now
79adbdd2 Use Minion Admin plugin with the cache service
bd72dca4 Drop downloading, Cache.pm now records just etag and size
84f12547 Use Devel::Cover in cache-service tests to enhance coverage
54ce7e5c Adapt cache tests - still fails on size calculation
0662f43c Move Cache.pm code to Mojo::SQLite
8d01039d Add dequeue() and tests for it
d9caf916 Fix and test enqueued tasks
2939bb24 Do not use list_jobs from Minion to gather statuses, causes races
fcd74a0d Add default usage tests and use minion lock to detect if it's in use
eabb02cf Fix test copyright header
14551a8e Add standard client usage
204375b5 Add Cache service client
bbceb9d1 Do not lock anymore on assets
688ca9f0 Add test for cache service
5acaea86 Add script to start cache service
ebb1307a Create separate class for Minion Server and worker
1bf1a1bc (Martchus/show_only_latest_on_overview) Prevent filter on test overview to reveal old jobs
10136589 Test that filtering overview does not reveal old jobs

I suspect https://github.com/os-autoinst/openQA/pull/1783 as it is related to "cache".

On openqaworker1 I started the systemd services openqa-worker-cacheservice-minion and openqa-worker-cacheservice manually and triggered a job https://openqa.opensuse.org/tests/796621#live which seems to run for now. I wonder if the services should be automatically started by package upgrades? But I guess not really because caching so far was just a "config option".

However I have the following observations:

# journalctl -f
-- Logs begin at Thu 2018-11-15 03:32:45 CET. --
Nov 15 09:34:33 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: find: ‘/var/lib/openqa/cache/lost+found’: Permission denied
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: find: ‘/var/lib/openqa/cache/lost+found’: Permission denied
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [DEBUG] CACHE: Health: Real size: 52787683840, Configured limit: 53687091200
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52787683840
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: [DEBUG] CACHE: Health: Real size: 50542682624, Configured limit: 53687091200
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 50542682624
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [4683] [i] Listening at "http://127.0.0.1:7844"
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [4683] [i] Listening at "http://[::1]:7844"
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: [4684] [i] Worker 4684 started


Nov 15 09:34:53 openqaworker1 worker[3294]: GLOB(0x55d163c9d268)[info] got job 796621: 00796621-opensuse-Staging:J-Staging-DVD-x86_64-Build1426.2-gnome@64bit
Nov 15 09:34:53 openqaworker1 worker[3294]: [info] +++ setup notes +++
Nov 15 09:34:53 openqaworker1 worker[3294]: [info] start time: 2018-11-15 08:34:53
Nov 15 09:34:53 openqaworker1 worker[3294]: [info] running on openqaworker1:14 (Linux 4.12.14-lp150.12.25-default #1 SMP Thu Nov 1 06:14:23 UTC 2018 (3fcf457) x86_64)
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 98.
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 98.
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 openqa-workercache[4684]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 openqa-workercache[4684]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 openqa-workercache[4684]: find: ‘/var/lib/openqa/cache/lost+found’: Permission denied

Related issues 3 (1 open2 closed)

Copied to openQA Project (public) - action #43928: error about "permission denied" in openqa-workercache service logs trying to call "find" on "lost+found" directory of cacheResolvedokurz2018-11-15

Actions
Copied to openQA Project (public) - action #43931: Message about deprecated calls "Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error"Resolvedkraih2018-11-15

Actions
Copied to openQA Project (public) - coordination #43934: [epic] Manage o3 infrastructure with salt againBlockedokurz2021-03-16

Actions
Actions #1

Updated by okurz about 6 years ago

Trying to update the systemd services on all workers but salt seems to be not reachable except for power8.

AI: Bring back salt on workers -> #43934

$ for i in power8 aarch64 openqaworker1 openqaworker4 ; do ssh root@$i "systemctl enable --now openqa-worker-cacheservice && systemctl enable --now openqa-worker-cacheservice-minion" ; done
Failed to execute operation: No such file or directory
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-cacheservice.service → /usr/lib/systemd/system/openqa-worker-cacheservice.service.
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-cacheservice-minion.service → /usr/lib/systemd/system/openqa-worker-cacheservice-minion.service.
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-cacheservice.service → /usr/lib/systemd/system/openqa-worker-cacheservice.service.
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-cacheservice-minion.service → /usr/lib/systemd/system/openqa-worker-cacheservice-minion.service.
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-cacheservice.service → /usr/lib/systemd/system/openqa-worker-cacheservice.service.
Created symlink /etc/systemd/system/multi-user.target.wants/openqa-worker-cacheservice-minion.service → /usr/lib/systemd/system/openqa-worker-cacheservice-minion.service.

power8 was not updated since September 10, probably not upgraded by sysrich. Doing the update of packages manually now even though it is not exactly the same package of openQA-worker. The most recent one is now 4.6.1542260661.49f37fa5-882.1 which includes a (trivial) change "Minor fixes for WritingTests doc (#1865)":

zypper dup
systemctl enable --now openqa-worker-cacheservice && systemctl enable --now openqa-worker-cacheservice-minion

AI: Align power8 to other workers -> #43937

https://openqa.opensuse.org/tests/796751/file/autoinst-log.txt "Can't download openSUSE-Tumbleweed-NET-x86_64-Snapshot20181114-Media.iso to /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-NET-x86_64-Snapshot20181114-Media.iso" but it has been processed by the cache service, what now?

Or a different example on openqaworker4 in https://openqa.opensuse.org/tests/796792/file/autoinst-log.txt :

[2018-11-15T09:49:44.0743 CET] [debug] Downloading opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2 - request sent to Cache Service.
[2018-11-15T09:59:47.0405 CET] [debug] Download of opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2 processed
[2018-11-15T09:59:47.0414 CET] [debug] [INFO] CACHE: Purging non registered /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging-DVD-x86_64.qcow2
[DEBUG] CACHE: removed /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Staging-DVD-x86_64.qcow2
[DEBUG] CACHE: Health: Real size: 50963469312, Configured limit: 53687091200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 50963469312
[INFO] Downloading opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2 from http://openqa1-opensuse/tests/796792/asset/hdd/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2
[DEBUG] CACHE: Downloading /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2: 
[DEBUG] CACHE: Downloading /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2: 
…
[DEBUG] CACHE: Downloading /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2: 
[DEBUG] CACHE: Health: Real size: 48718468096, Configured limit: 53687091200
[INFO] CACHE: updating the /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2 with "ffb51200-5646c03e0b445" and 4290056704
[DEBUG] CACHE: Asset download successful to /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2, Cache size is: 53008524800

[2018-11-15T09:59:47.0416 CET] [warn] job is missing files, releasing job
[2018-11-15T09:59:47.0455 CET] [info] +++ worker notes +++
[2018-11-15T09:59:47.0455 CET] [info] end time: 2018-11-15 08:59:47
[2018-11-15T09:59:47.0456 CET] [info] result: setup failure: Can't download opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2 to /var/lib/openqa/cache/openqa1-opensuse/opensuse-42.2-x86_64-Updates-20180205-1-kde@64bit.qcow2

so first the cache service downloads but then it's gone.

coolo suggested to remove the cache on all workers first.

So I stopped all worker service instances on all workers, removed the content from openqaworker4:/var/lib/openqa/cache/ and started the services on openqaworker4. Some jobs started but are not showing screenshots nor a live view as of now. Some errors related to cache:

# tail -f /var/lib/openqa/pool/15/autoinst-log.txt 
[INFO] CACHE: Purging non registered /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
[ERROR] CACHE: Could not remove /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
[DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-13.2-x86_64.qcow2
[INFO] CACHE: Purging non registered /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
[ERROR] CACHE: Could not remove /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
[DEBUG] CACHE: removed /var/lib/openqa/cache/old/openSUSE-Tumbleweed-KDE-Live-x86_64-Snapshot20181113-Media.iso
[DEBUG] CACHE: Health: Real size: 0, Configured limit: 53687091200
[INFO] CACHE: updating the /var/lib/openqa/cache/openqa1-opensuse/Argon.x86_64-15.0-Build4.150.iso with "39c00000-57a9e8ef3a40b" and 968884224
[DEBUG] CACHE: Asset download successful to /var/lib/openqa/cache/openqa1-opensuse/Argon.x86_64-15.0-Build4.150.iso, Cache size is: 968884224

AI: Report on individual error entries -> #43940

After the jobs on openqaworker4 seem to be fine on all other workers I deleted the cache, started the cache services and on worker instance:

for i in power8 aarch64 imagetester openqaworker1 ; do echo $i && ssh root@$i "rm -rf /var/lib/openqa/cache/* ; systemctl start openqa-worker-cacheservice openqa-worker-cacheservice-minion openqa-worker@1"; done

And then

for i in power8 aarch64 imagetester openqaworker1 ; do echo $i && ssh root@$i "systemctl start openqa-worker.target"; done

Checked with

for i in power8 aarch64 imagetester openqaworker1 openqaworker4 ; do echo $i && ssh root@$i "systemctl is-active openqa-worker@*"; done

aarch64 seems to have a problem:

Nov 15 10:28:41 openqa-aarch64 worker[5209]: cannot open lock file: Permission denied

I can reproduce the same problem with

sudo -u _openqa-worker strace -o worker.strace -f /usr/share/openqa/script/worker --instance 1

even with apparmor.service disabled. But I think apparmor is still "in effect" though. What I saw in systemctl status apparmor.service is

Nov 15 03:32:59 openqa-aarch64 apparmor.systemd[1108]: Warning failed to create cache: usr.share.openqa.script.worker
Nov 15 03:32:59 openqa-aarch64 apparmor.systemd[1108]: Skipped: /etc/apparmor.d/usr.share.openqa.script.worker.rpmsave
Nov 15 03:32:59 openqa-aarch64 AppArmor(init)[1631]: Skipping profile /etc/apparmor.d/usr.share.openqa.script.worker.rpmsave

so I did mv usr.share.openqa.script.worker.rpmsave usr.share.openqa.script.worker and enabled apparmor again. Jobs run fine now.

Actions #2

Updated by okurz about 6 years ago

  • Status changed from In Progress to Feedback

Seems all is good now. I will need to provide specific action items based on the above findings.

Actions #3

Updated by okurz about 6 years ago

  • Copied to action #43928: error about "permission denied" in openqa-workercache service logs trying to call "find" on "lost+found" directory of cache added
Actions #4

Updated by okurz about 6 years ago

  • Copied to action #43931: Message about deprecated calls "Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error" added
Actions #5

Updated by okurz about 6 years ago

Actions #6

Updated by okurz about 6 years ago

  • Status changed from Feedback to Resolved

okurz wrote:

I will need to provide specific action items based on the above findings.

done

Actions

Also available in: Atom PDF