action #57539

imagetester is incompleting all jobs with existing but empty logs, as /var/lib/openqa/pool is full

Added by okurz 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:30/09/2019
Priority:HighDue date:08/10/2019
Assignee:okurz% Done:

0%

Category:-
Target version:openQA Project - Done
Duration: 7

Description

[30/09/2019 14:56:03] <fvogt> o3 looks very broken - all tests incomplete
[30/09/2019 14:57:50] <DimStar> fvogt: seems to be mostly imagetester?
[30/09/2019 14:58:36] <fvogt> Indeed
[30/09/2019 14:58:39] <fvogt> openqaworker1 works
[30/09/2019 15:22:13] <okurz> https://openqa.opensuse.org/tests/1044219 is incomplete on openqaworker1, so not *that* good
[30/09/2019 15:22:39] <fvogt> okurz: At least it has non-empty logs#
[30/09/2019 15:24:23] <okurz> yes, I disabled all openqa workers on imagetester for now as / is full and will restart all incomplete jobs from this machine

History

#1 Updated by okurz 5 months ago

  • Subject changed from imagetester is incompleting all jobs with existing but empty logs, as / is full to imagetester is incompleting all jobs with existing but empty logs, as /var/lib/openqa/pool is full
  • Priority changed from Immediate to Normal
host=openqa.opensuse.org; worker=imagetester; failed_since=2019-09-30; for i in $(ssh $host "sudo -u geekotest psql --no-align --tuples-only --command=\"select id from jobs where (assigned_worker_id in (select id from workers where host='$worker' and result='incomplete' and t_finished >= '$failed_since'));\" openqa"); do openqa-client --host $host jobs/$i/restart post; done

Checking when this might have started. Looking with journalctl -u openqa-worker@* I found the first "Result: died" coming from worker with PID 2110, started around "Sep 30 03:34:24" , that is https://openqa.opensuse.org/tests/1044019/file/autoinst-log.txt showing

[2019-09-30T05:06:02.103 CEST] [debug] /var/lib/openqa/cache/openqa1-opensuse/tests/opensuse/tests/x11/sshxterm.pm:43 called testapi::type_string
[2019-09-30T05:06:02.103 CEST] [debug] <<< testapi::type_string(string='killall xterm
', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-09-30T05:06:02.449 CEST] [debug] <<< testapi::assert_screen(mustmatch='generic-desktop', timeout=30)
libpng error: Write Error
[2019-09-30T05:06:03.828 CEST] [debug] >>> testapi::_handle_found_needle: found generic-desktop-kde-plasma512-leap15.1-aarch64-20190409, similarity 1.00 @ 2/733
[2019-09-30T05:06:03.830 CEST] [debug] ||| finished sshxterm x11 at 2019-09-30 03:06:03 (45 s)
Can't close(GLOB(0x5617dca65888)) filehandle: 'No space left on device' at /usr/lib/os-autoinst/bmwqemu.pm line 322

imagetester is configured for the pool using tmpfs but with only 64GB and current tests using often 40GB we are not able to sustain even more than one instance. We would have more room on /dev/sda:

/dev/sda1           3.6T   54G  3.4T   2% /var/lib/openqa/cache
tmpfs                64G   32K   64G   1% /var/lib/openqa/pool

I am not aware of recent changes regarding this.

#2 Updated by okurz 5 months ago

I have asked on #opensuse-factory and openqa-dev (RC) if anyone knows of recent changes involving the 64GB tmpfs pool dir.

#3 Updated by okurz 5 months ago

  • Due date set to 08/10/2019
  • Status changed from In Progress to Feedback
  • Priority changed from Normal to High

should keep it on High and with due date to see what we need to do as long as we have the workers disabled completely. Masked the worker target and workers

systemctl mask --now openqa-worker.target openqa-worker@{1..5}

Waiting until next week, 2019-10-08, to see if anyone else comes back with a good idea of what happened.

EDIT: As decided in QA tools meeting 2019-10-01 we will reduce the the worker instances to what should be safe, I decided for two instances.

#4 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved
  • Target version set to Done

Also available in: Atom PDF