Project

General

Profile

Actions

action #43823

closed

o3 workers immediately incompleting all jobs, caching service can not be reached

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
2018-11-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

[15/11/2018 08:37:44] <fvogt> openQA is down: "result: setup failure: Cache service not available.". Just a usual Thursday?
[15/11/2018 08:38:15] <DimStar> already 4 hours ago - so a bit too early for a 'usual' thursday
…
[15/11/2018 09:14:22] <DimStar> okurz: who do I kick to fix o3/caching? :)
…
[15/11/2018 09:23:03] <okurz> what is the systemd service openqa-worker@multi-user?
[15/11/2018 09:28:14] <okurz> DimStar, sysrich: I suspect tonights upgrade of the o3 openQA workers broke it. I will try to find out the diff in the versions and see if I can revert something

Further investigation

From openqaworker1:/var/log/zypper.log:

2018-11-15 00:55:26 <1> openqaworker1(5900) [zypper++] Summary.cc(readPool):135 <uninstall> I_TsU(43455)openQA-worker-4.6.1542112053.3a061720-lp150.870.1.noar
ch(@System)
2018-11-15 00:55:41 <1> openqaworker1(5900) [zypp] PackageProvider.cc(providePackage):412 provide Package (38)openQA-worker-4.6.1542208789.64bcff24-lp150.880.1.noarch(devel_openQA)

-> package openQA-worker got updated from 4.6.1542112053.3a061720-lp150.870.1 to 4.6.1542208789.64bcff24-lp150.880.1 which has the following git commits included:

64bcff24 Add 2 new dependencies to the travis container: (#1864)
3a07cfa6 docs: Extend definition of *softfailed* to "known issues" (#1837)
7d4d9727 Refuse updating job group name with empty or blank (#1844)
fcf42650 Merge pull request #1852 from Martchus/show_only_latest_on_overview
bb46c55c Apparmor: Add /dev/hugepages (#1857)
7c800f79 Fix broken in section of inter machine dependencies (#1859)
9da05482 Merge pull request #1783 from mudler/separate_minion_cache_service
772ee569 Wrap up and polish
4e04c40e Do not use wantarray in GRU's enqueue()
b6354356 Update POD OpenQA::Worker::Cache::Client's SYNOPSIS
6d0f1b56 Avoid to race to get available minion workers
968ed4c9 Drop PID from log
8c7989c4 Move Minion workers under a single service
20cd5a5e Cleanup
5120b9df Fix typo
ea55d577 Adapt gru tests
84c508f2 Generate same lock for asset/host in the request object
b95c4105 Fix rsync output in minion task
b10f6265 Refactor and update Documentation
8f94fbcf Use the cache service to perform rsync tests caching
eb302f01 Create Minion task for rsync execution
aed9eda6 Use the host folder to cache tests/needles
4709b255 Do not check if child is running if it's not there
817b8669 Reduce the lock expiration time
262fe2cf Trim spaces from left and right in assets name
f2cefd87 Hold the db handle during transactions
73b059b5 Add test for Minion task registration and execution
c6bb9043 specfile: DBD-SQLite needs to be > 1.51, required by Minion::Backend::SQLite
b6c814c6 Add tests and cover also subprocess execution
a98f78fb Add function to generate session token
b559f9eb Read and set limit from config file, defaults to 50GB
88a224b1 Makefile: install Worker Cache Service systemd unit
f5268886 Add POD for Cache Service and extend docs/Installing.asciidoc
353c0bf5 Add service file and adapt openQA.spec
59e725c8 Avoid useless quotes on hash key
0857b29d Debug - adapt compile-check-all
976830b3 Address to PR comments - also add test for checking output in cache-service.t
52eee59c Drop unused function
910991c2 Write cache output into os-autoinst log
f86e165f Make possible to retrieve cache output
a0eb5c56 Add debug messages while enqueuing download
0b88e3f6 Asset path needs to be a string for vars
ff4f68af Make dequeue and status post
4eeef27f Adapt fullstack to cache path change
2c28bcf0 Extract domain parsing to _base_url and use it also in the client to resolve asset path
6104562c Adapt tests
09cb313c Get the cache path right
806560ec Incomplete job if cache service is down
3e62a820 tests: Wait also for worker minion in fullstack
3248c99c Use defaults in isotovideo engine
e6d6a321 Update defaults and remove protoyped function from Cache::Client
3ea979ae tests: Add test for available_workers()
aa0926f1 Add available_workers() to Cache::Client
ce3a5011 Fullstack: move sqlite cache check up
a280d126 Cache: make retry() return always the response object
4d2d04d4 Fullstack: Wait cache to be ready
f4d6e2dd Adapt tests and utils
378f61ea Drop unneeded default
69978a7f Use OpenQA::Worker::Client inside the cache task
dfaeca5b Start to adapt fullstack test
85a4a839 Adapt to PBP
7a52f2a1 Add dep on Minion::Backend::SQLite
ce0a9905 Move common testing functions to Test::Utils as they will be reused also in fullstack test
9393d5e0 Add Mojo::SQLite dependency
f1a1c59e Use the Cache service within the worker
b48c9196 Add accessors to retrieve asset path and check if exists
11605149 Remove unneded log line and cleanup check_limits()
85bf029d Finishing porting the Cache.pm code to Mojo::SQLite
b7f63eae Rework cleanup strategy, avoid endless loops - update etag fails now
79adbdd2 Use Minion Admin plugin with the cache service
bd72dca4 Drop downloading, Cache.pm now records just etag and size
84f12547 Use Devel::Cover in cache-service tests to enhance coverage
54ce7e5c Adapt cache tests - still fails on size calculation
0662f43c Move Cache.pm code to Mojo::SQLite
8d01039d Add dequeue() and tests for it
d9caf916 Fix and test enqueued tasks
2939bb24 Do not use list_jobs from Minion to gather statuses, causes races
fcd74a0d Add default usage tests and use minion lock to detect if it's in use
eabb02cf Fix test copyright header
14551a8e Add standard client usage
204375b5 Add Cache service client
bbceb9d1 Do not lock anymore on assets
688ca9f0 Add test for cache service
5acaea86 Add script to start cache service
ebb1307a Create separate class for Minion Server and worker
1bf1a1bc (Martchus/show_only_latest_on_overview) Prevent filter on test overview to reveal old jobs
10136589 Test that filtering overview does not reveal old jobs

I suspect https://github.com/os-autoinst/openQA/pull/1783 as it is related to "cache".

On openqaworker1 I started the systemd services openqa-worker-cacheservice-minion and openqa-worker-cacheservice manually and triggered a job https://openqa.opensuse.org/tests/796621#live which seems to run for now. I wonder if the services should be automatically started by package upgrades? But I guess not really because caching so far was just a "config option".

However I have the following observations:

# journalctl -f
-- Logs begin at Thu 2018-11-15 03:32:45 CET. --
Nov 15 09:34:33 openqaworker1 systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: find: ‘/var/lib/openqa/cache/lost+found’: Permission denied
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: find: ‘/var/lib/openqa/cache/lost+found’: Permission denied
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [DEBUG] CACHE: Health: Real size: 52787683840, Configured limit: 53687091200
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52787683840
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: [DEBUG] CACHE: Health: Real size: 50542682624, Configured limit: 53687091200
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: [INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 50542682624
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [4683] [i] Listening at "http://127.0.0.1:7844"
Nov 15 09:34:33 openqaworker1 openqa-workercache[4683]: [4683] [i] Listening at "http://[::1]:7844"
Nov 15 09:34:33 openqaworker1 openqa-workercache[4684]: [4684] [i] Worker 4684 started


Nov 15 09:34:53 openqaworker1 worker[3294]: GLOB(0x55d163c9d268)[info] got job 796621: 00796621-opensuse-Staging:J-Staging-DVD-x86_64-Build1426.2-gnome@64bit
Nov 15 09:34:53 openqaworker1 worker[3294]: [info] +++ setup notes +++
Nov 15 09:34:53 openqaworker1 worker[3294]: [info] start time: 2018-11-15 08:34:53
Nov 15 09:34:53 openqaworker1 worker[3294]: [info] running on openqaworker1:14 (Linux 4.12.14-lp150.12.25-default #1 SMP Thu Nov 1 06:14:23 UTC 2018 (3fcf457) x86_64)
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 98.
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 98.
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 worker[3294]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 openqa-workercache[4684]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 openqa-workercache[4684]: Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error at /usr/share/openqa/script/../lib/OpenQA/Worker/Cache/Client.pm line 70.
Nov 15 09:34:53 openqaworker1 openqa-workercache[4684]: find: ‘/var/lib/openqa/cache/lost+found’: Permission denied

Related issues 3 (1 open2 closed)

Copied to openQA Project - action #43928: error about "permission denied" in openqa-workercache service logs trying to call "find" on "lost+found" directory of cacheResolvedokurz2018-11-15

Actions
Copied to openQA Project - action #43931: Message about deprecated calls "Mojo::Transaction::success is DEPRECATED in favor of Mojo::Transaction::result and Mojo::Transaction::error"Resolvedkraih2018-11-15

Actions
Copied to openQA Project - coordination #43934: [epic] Manage o3 infrastructure with salt againBlockedokurz2021-03-16

Actions
Actions

Also available in: Atom PDF