action #57539: imagetester is incompleting all jobs with existing but empty logs, as /var/lib/openqa/pool is full - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #57539

closed

imagetester is incompleting all jobs with existing but empty logs, as /var/lib/openqa/pool is full

Added by okurz over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Done

Start date:

2019-09-30

Due date:

2019-10-08

% Done:

Estimated time:

Description

[30/09/2019 14:56:03] <fvogt> o3 looks very broken - all tests incomplete
[30/09/2019 14:57:50] <DimStar> fvogt: seems to be mostly imagetester?
[30/09/2019 14:58:36] <fvogt> Indeed
[30/09/2019 14:58:39] <fvogt> openqaworker1 works
[30/09/2019 15:22:13] <okurz> https://openqa.opensuse.org/tests/1044219 is incomplete on openqaworker1, so not *that* good
[30/09/2019 15:22:39] <fvogt> okurz: At least it has non-empty logs#
[30/09/2019 15:24:23] <okurz> yes, I disabled all openqa workers on imagetester for now as / is full and will restart all incomplete jobs from this machine

Actions

Copy link

Updated by okurz over 5 years ago

Subject changed from imagetester is incompleting all jobs with existing but empty logs, as / is full to imagetester is incompleting all jobs with existing but empty logs, as /var/lib/openqa/pool is full
Priority changed from Immediate to Normal

host=openqa.opensuse.org; worker=imagetester; failed_since=2019-09-30; for i in $(ssh $host "sudo -u geekotest psql --no-align --tuples-only --command=\"select id from jobs where (assigned_worker_id in (select id from workers where host='$worker' and result='incomplete' and t_finished >= '$failed_since'));\" openqa"); do openqa-client --host $host jobs/$i/restart post; done

Checking when this might have started. Looking with journalctl -u openqa-worker@* I found the first "Result: died" coming from worker with PID 2110, started around "Sep 30 03:34:24" , that is https://openqa.opensuse.org/tests/1044019/file/autoinst-log.txt showing

[2019-09-30T05:06:02.103 CEST] [debug] /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/tests/x11/sshxterm.pm:43 called testapi::type_string
[2019-09-30T05:06:02.103 CEST] [debug] <<< testapi::type_string(string='killall xterm
', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-09-30T05:06:02.449 CEST] [debug] <<< testapi::assert_screen(mustmatch='generic-desktop', timeout=30)
libpng error: Write Error
[2019-09-30T05:06:03.828 CEST] [debug] >>> testapi::_handle_found_needle: found generic-desktop-kde-plasma512-leap15.1-aarch64-20190409, similarity 1.00 @ 2/733
[2019-09-30T05:06:03.830 CEST] [debug] ||| finished sshxterm x11 at 2019-09-30 03:06:03 (45 s)
Can't close(GLOB(0x5617dca65888)) filehandle: 'No space left on device' at /usr/lib/os-autoinst/bmwqemu.pm line 322

imagetester is configured for the pool using tmpfs but with only 64GB and current tests using often 40GB we are not able to sustain even more than one instance. We would have more room on /dev/sda:

/dev/sda1           3.6T   54G  3.4T   2% /var/lib/openqa/cache
tmpfs                64G   32K   64G   1% /var/lib/openqa/pool

I am not aware of recent changes regarding this.

Actions

Copy link

Updated by okurz over 5 years ago

I have asked on #opensuse-factory and openqa-dev (RC) if anyone knows of recent changes involving the 64GB tmpfs pool dir.

Actions

Copy link

Updated by okurz over 5 years ago

Due date set to 2019-10-08
Status changed from In Progress to Feedback
Priority changed from Normal to High

should keep it on High and with due date to see what we need to do as long as we have the workers disabled completely. Masked the worker target and workers

systemctl mask --now openqa-worker.target openqa-worker@{1..5}

Waiting until next week, 2019-10-08, to see if anyone else comes back with a good idea of what happened.

EDIT: As decided in QA tools meeting 2019-10-01 we will reduce the the worker instances to what should be safe, I decided for two instances.

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Feedback to Resolved
Target version set to Done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #57539

imagetester is incompleting all jobs with existing but empty logs, as /var/lib/openqa/pool is full

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago