action #62243
closedAfter latest updates, openQA has problematic behavior on Dell Precision 5810
Added by syrianidou_sofia almost 5 years ago. Updated almost 5 years ago.
Description
After updating my workstation, OpenQA started having issue to run any job. The job remains on schedule, even though sometimes it runs. Sometimes it gets terminated without any error messages or logs. Sometimes I get the error "os-autoinst command server not available, job is likely not running" even though job runs. Video or Live view are never available. Initially, I thought this was a problem with Tumpleweed, so I formatted the workstation and installed Leap. I have the exact same behavior on Leap as well. Also in containerized version.
Link to OpenQA in container:
http://falafel.suse.cz/
Files
os-autoinst_job1.txt (4.51 KB) os-autoinst_job1.txt | Logs from one of first fails, on Tumbleweed | syrianidou_sofia, 2020-01-17 13:15 | |
pool_folder1.tar.gz (432 KB) pool_folder1.tar.gz | test failing in container openQA | syrianidou_sofia, 2020-01-17 13:15 | |
pool_folder2.tar.gz (171 KB) pool_folder2.tar.gz | another test failing in container | syrianidou_sofia, 2020-01-17 13:16 |
Updated by syrianidou_sofia almost 5 years ago
- File os-autoinst_job1.txt os-autoinst_job1.txt added
- File pool_folder1.tar.gz pool_folder1.tar.gz added
- File pool_folder2.tar.gz pool_folder2.tar.gz added
Updated by okurz almost 5 years ago
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category set to Support
- Priority changed from Normal to High
Updated by okurz almost 5 years ago
- Status changed from New to Feedback
- Assignee set to okurz
- Target version set to Current Sprint
What I see from
[2020-01-17T13:49:12.0096 CET] [error] REST-API error (POST http://localhost/api/v1/jobs/18/status): 400 response: Got status update for job 18 with unexpected worker ID 1 (expected 2, job is scheduled) (remaining tries: 2)
[2020-01-17T13:49:17.0112 CET] [error] REST-API error (POST http://localhost/api/v1/jobs/18/status): 400 response: Got status update for job 18 and worker 1 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 1)
[2020-01-17T13:49:22.0132 CET] [error] REST-API error (POST http://localhost/api/v1/jobs/18/status): 400 response: Got status update for job 18 and worker 1 but there is not even a worker assigned to this job (job is scheduled) (remaining tries: 0)
[2020-01-17T13:49:22.0132 CET] [error] Aborting job because web UI doesn't accept new images anymore (likely considers this job dead)
looks suspiciously like #62015
@syrianidou_sofia if you have a backup or snapshot of older packages I suggest to rollback to the previous versions if they did not make any problem for you. You can also try the version which comes with Leap's official repositories. The version might be too old to be compatible with current SLE tests which most likely need some more recent version.
There is no clear workaround but you can try to experiment with less or more workers in parallel or also workers on different machines.
Updated by okurz almost 5 years ago
- Related to action #62015: jobs incomplete without logs as some workers are rejected (was: Scheduler does not work) added
Updated by syrianidou_sofia almost 5 years ago
Unfortunately, I don't have snapshot, as I have formatted my workstation, in my attempt to fix the issue. I have been trying for 3 days or more to find workaround but I am not familiar with openQA. Does the following error message tell you anything?
Use of uninitialized value in split at /var/lib/openqa/share/tests/sle/lib/bootloader_setup.pm line 312.
Updated by okurz almost 5 years ago
syrianidou_sofia wrote:
Unfortunately, I don't have snapshot, as I have formatted my workstation, in my attempt to fix the issue. I have been trying for 3 days or more to find workaround but I am not familiar with openQA. Does the following error message tell you anything?
Use of uninitialized value in split at /var/lib/openqa/share/tests/sle/lib/bootloader_setup.pm line 312.
well it means that https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/bootloader_setup.pm#L312 is bad test code and this should be fixed but I doubt this is related to your problem. As I prefer to have these warnings fixed so that they do not distract from other problems – as happened here as well – I prepared a fix in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/9402
I guess local colleagues with experience in openQA could be a more efficient help but I also offer my help for a more interactive session, e.g. you can ping me over chat and/or provide me login details so that I can at least take a look on the machine. If you like I can also try to fix it.
Updated by mkittler almost 5 years ago
It really looks similar to #62015 although in your case it is always happening, right? And considering the ticket descriptions in your case there might be some additional problems. To investigate the problem it makes sense to look not only into the openqa-worker@ log but also in the openqa-webui, openqa-websockets and openqa-scheduler log (so attach those files similar to #62015).
Updated by okurz almost 5 years ago
@mkittler thank you for also looking into this issue. I also suspect #62015.
Sofia and me got in contact over chat.
One issue is that openqa services "do not start by themselves, even though they are enabled. Another observation is that usually after rebooting the machine, the first job I send to openQA works. the job is sent, remaining in the schedule and sometimes it runs (usually failing at some point), other times it just waits in the schedule"
I logged into the machine and at the time found that the system was executing tests. no failed systemd services, enough computing capacity. That was after a fresh reboot. What I see though is that the machine has a HDD but 5 workers enabled. This might be a problem. I would assume that one worker works fine, maybe two. I would not guarantee more than 2 VMs in parallel without a SSD.
We could not define a good range of potential affected package versions. The "last last working update was a month a go". "the problem started after I updated two weeks ago" so "first bad" is not older than any package created after 2020-01-13 I guess. In /var/log/history I find "2020-01-15 11:57:02|install|openQA-bootstrap|4.6.1555478309.f6dfdd9b-lp151.1.7" as the earliest entry so I guess any changes after that haven't introduced the problem, so the faulty version, if it's a regression in openQA, must have been before that.
I looked into the system journal with journalctl -b
to show the complete system log starting with the most recent boot of the system and I see a lot of errors on startup, especially from the openQA workers. Looks like the openQA workers start very soon and the network isn't even fully up which takes longer.
package repos including priorities look fine:
# zypper lr --details
# | Alias | Name | Enabled | GPG Check | Refresh | Priority | Type | URI | Service
---+---------------------------+------------------------------------+---------+-----------+---------+----------+--------+-----------------------------------------------------------------------------------------+--------
1 | SUSE:CA | SUSE:CA | Yes | (r ) Yes | No | 99 | rpm-md | http://download.suse.de/ibs/SUSE:/CA/openSUSE_Leap_15.1 |
2 | devel:openQA | devel:openQA | Yes | (r ) Yes | No | 90 | rpm-md | https://download.opensuse.org/repositories/devel:/openQA/openSUSE_Leap_15.1 |
3 | devel:openQA:Leap:15.1 | devel:openQA:Leap:15.1 | Yes | (r ) Yes | No | 91 | rpm-md | https://download.opensuse.org/repositories/devel:/openQA:/Leap:/15.1/openSUSE_Leap_15.1 |
8 | repo-non-oss | Non-OSS Repository | Yes | (r ) Yes | Yes | 99 | rpm-md | http://download.opensuse.org/distribution/leap/15.1/repo/non-oss/ |
9 | repo-oss | Main Repository | Yes | (r ) Yes | Yes | 99 | rpm-md | http://download.opensuse.org/distribution/leap/15.1/repo/oss/ |
12 | repo-update | Main Update Repository | Yes | (r ) Yes | Yes | 99 | rpm-md | http://download.opensuse.org/update/leap/15.1/oss |
13 | repo-update-non-oss | Update Repository (Non-Oss) | Yes | (r ) Yes | Yes | 99 | rpm-md | http://download.opensuse.org/update/leap/15.1/non-oss/ |
Still I am upgrading right now to know which package versions we test with.
# rpm -qa | grep '\(openQA\|os-autoinst\)'
…
os-autoinst-4.6.1579522398.2afcda12-lp151.320.1.x86_64
openQA-4.6.1579718395.75fc711c3-lp151.2189.1.noarch
Two worker instances activated, openqa-worker@{1..2}
One job starts fine: http://falafel.suse.cz/tests/40
Stopped again (in the middle). And stopped services with systemctl stop openqa-worker@{1..2} openqa-webui openqa-scheduler openqa-websockets openqa-livehandler openqa-gru
.
When doing this locally I also have problems:
Jan 22 20:05:57 linux-28d6.suse worker[3061]: [info] 3290: WORKING 26
Jan 22 20:05:58 linux-28d6.suse worker[3061]: [info] Isotovideo exit status: 1
Jan 22 20:05:58 linux-28d6.suse openqa[3053]: [info] Got status update for job 26 and worker 2 but there is not even a worker assigned to this job (job is running)
Jan 22 20:05:58 linux-28d6.suse worker[3061]: [error] REST-API error (POST http://localhost/api/v1/jobs/26/status): 400 response: Got status update for job 26 and worker 2 but there is not even a worker assigned to this job (job is running) (remaining tries: 2)
Jan 22 20:06:03 linux-28d6.suse openqa[3053]: [info] Got status update for job 26 and worker 2 but there is not even a worker assigned to this job (job is running)
Jan 22 20:06:03 linux-28d6.suse worker[3061]: [error] REST-API error (POST http://localhost/api/v1/jobs/26/status): 400 response: Got status update for job 26 and worker 2 but there is not even a worker assigned to this job (job is running) (remaining tries: 1)
Jan 22 20:06:08 linux-28d6.suse openqa[3053]: [info] Got status update for job 26 and worker 2 but there is not even a worker assigned to this job (job is running)
Jan 22 20:06:08 linux-28d6.suse worker[3061]: [error] REST-API error (POST http://localhost/api/v1/jobs/26/status): 400 response: Got status update for job 26 and worker 2 but there is not even a worker assigned to this job (job is running) (remaining tries: 0)
Jan 22 20:06:08 linux-28d6.suse worker[3061]: [info] +++ worker notes +++
Jan 22 20:06:08 linux-28d6.suse worker[3061]: [info] End time: 2020-01-22 19:06:08
Jan 22 20:06:08 linux-28d6.suse worker[3061]: [info] Result: died
but this is a version from 2020-01-15 so updating to ensure I have the fix for #62015
EDIT: After upgrade tests also start fine for me. Back on falafel.suse.cz the current state using the new packages also seems fine. First two jobs were running fine in parallel and then I enabled again openqa-worker@{3..5} where I also triggered tests on. Stopping the complete stack segfaults isotovideo but besides that cleanup is fine:
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping openQA Worker #1...
Jan 22 21:38:46 falafel.suse.cz worker[8780]: [info] Quit due to signal TERM
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping openQA Worker #2...
Jan 22 21:38:46 falafel.suse.cz worker[8779]: [info] Quit due to signal TERM
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping openQA Worker #3...
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping openQA Worker #4...
Jan 22 21:38:46 falafel.suse.cz worker[9773]: [info] Quit due to signal TERM
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping openQA Worker #5...
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping Handler for live view in openQA's web UI...
Jan 22 21:38:46 falafel.suse.cz openqa-livehandler[8768]: Server available at http://[::1]:9528
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles...
Jan 22 21:38:46 falafel.suse.cz worker[9774]: [info] Quit due to signal TERM
Jan 22 21:38:46 falafel.suse.cz worker[9776]: [info] Quit due to signal TERM
Jan 22 21:38:46 falafel.suse.cz systemd[1]: Stopped Handler for live view in openQA's web UI.
Jan 22 21:38:46 falafel.suse.cz worker[8779]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20023/USRrlGRlgulbiD2l/broadcast
Jan 22 21:38:46 falafel.suse.cz worker[8780]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20013/IcHJqR3OQlfPb3os/broadcast
Jan 22 21:38:46 falafel.suse.cz worker[9773]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20033/R1qTWseTod_krgEW/broadcast
Jan 22 21:38:46 falafel.suse.cz worker[9774]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20043/e8VAoCujJuPprASx/broadcast
Jan 22 21:38:46 falafel.suse.cz worker[9776]: [info] Trying to stop job gracefully by announcing it to command server via http://localhost:20053/XXFn2IwR_Y7GSPpD/broadcast
Jan 22 21:38:48 falafel.suse.cz kernel: /usr/bin/isotov[9711]: segfault at 514 ip 00005612103e76a0 sp 00007fe347c1b8f0 error 4 in perl[561210327000+1f9000]
Jan 22 21:38:48 falafel.suse.cz kernel: /usr/bin/isotov[9704]: segfault at 514 ip 0000558e989826a0 sp 00007fc1496fd8f0 error 4 in perl[558e988c2000+1f9000]
Jan 22 21:38:48 falafel.suse.cz kernel: /usr/bin/isotov[9936]: segfault at 514 ip 00005639ac6e36a0 sp 00007fcc5430c8f0 error 4 in perl[5639ac623000+1f9000]
Jan 22 21:38:48 falafel.suse.cz worker[9773]: [info] Isotovideo exit status: 0
Jan 22 21:38:48 falafel.suse.cz kernel: /usr/bin/isotov[9870]: segfault at 514 ip 0000557c5570a6a0 sp 00007fd4538358f0 error 4 in perl[557c5564a000+1f9000]
Jan 22 21:38:48 falafel.suse.cz worker[9776]: [info] Isotovideo exit status: 0
Jan 22 21:38:48 falafel.suse.cz kernel: /usr/bin/isotov[9833]: segfault at 514 ip 00005572569c26a0 sp 00007f97cacd18f0 error 4 in perl[557256902000+1f9000]
Jan 22 21:38:48 falafel.suse.cz worker[9774]: [info] Isotovideo exit status: 0
Jan 22 21:38:49 falafel.suse.cz openqa[8778]: [info] Worker 8778 stopped
Jan 22 21:38:49 falafel.suse.cz systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles.
Jan 22 21:38:49 falafel.suse.cz worker[8779]: [info] Isotovideo exit status: 1
Jan 22 21:38:49 falafel.suse.cz worker[8780]: [info] Isotovideo exit status: 1
Jan 22 21:38:54 falafel.suse.cz worker[8779]: [info] +++ worker notes +++
Jan 22 21:38:54 falafel.suse.cz worker[8779]: [info] End time: 2020-01-22 20:38:54
Jan 22 21:38:54 falafel.suse.cz worker[8779]: [info] Result: quit
Jan 22 21:38:54 falafel.suse.cz worker[8780]: [info] +++ worker notes +++
Jan 22 21:38:54 falafel.suse.cz worker[8780]: [info] End time: 2020-01-22 20:38:54
Jan 22 21:38:54 falafel.suse.cz worker[8780]: [info] Result: quit
…
starting stack again starts up all services and workers and after some seconds the workers (re-)connect and test jobs start up correctly. Triggered reboot of the machine to check how services behave during bootup.
After bootup I checked with systemctl --failed
and found:
# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● openqa-livehandler.service loaded failed failed Handler for live view in openQA's web UI
● openqa-scheduler.service loaded failed failed The openQA Scheduler
● openqa-websockets.service loaded failed failed The openQA WebSockets server
● openqa-webui.service loaded failed failed The openQA web UI
with details
Jan 22 21:42:29 falafel openqa-scheduler[1282]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:29 falafel openqa-websockets[1283]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:29 falafel systemd[1]: openqa-scheduler.service: Main process exited, code=exited, status=22/n/a
Jan 22 21:42:29 falafel systemd[1]: openqa-scheduler.service: Unit entered failed state.
Jan 22 21:42:29 falafel systemd[1]: openqa-scheduler.service: Failed with result 'exit-code'.
Jan 22 21:42:29 falafel systemd[1]: openqa-websockets.service: Main process exited, code=exited, status=22/n/a
Jan 22 21:42:29 falafel systemd[1]: openqa-websockets.service: Unit entered failed state.
Jan 22 21:42:29 falafel systemd[1]: openqa-websockets.service: Failed with result 'exit-code'.
Jan 22 21:42:31 falafel openqa-livehandler[1248]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:31 falafel systemd[1]: openqa-livehandler.service: Main process exited, code=exited, status=22/n/a
Jan 22 21:42:31 falafel systemd[1]: openqa-livehandler.service: Unit entered failed state.
Jan 22 21:42:31 falafel systemd[1]: openqa-livehandler.service: Failed with result 'exit-code'.
Jan 22 21:42:32 falafel.suse.cz openqa[1284]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
Jan 22 21:42:32 falafel.suse.cz openqa[1286]: [info] Resetting all leftover Gru locks after restart
Jan 22 21:42:32 falafel.suse.cz systemd[1]: openqa-webui.service: Main process exited, code=exited, status=22/n/a
Jan 22 21:42:32 falafel.suse.cz systemd[1]: openqa-webui.service: Unit entered failed state.
Jan 22 21:42:32 falafel.suse.cz systemd[1]: openqa-webui.service: Failed with result 'exit-code'.
the messages which I have seen on initial login earlier this day as well. A restart of these services fixes the problem. On the machine NetworkManager is used which is also the new default in newer Leap and Tumbleweed. At the time openQA was initially developed wicked was used for managing networks by default. My hypothesis is that there is a race condition in setting up the openQA services and network on bootup of the system. The complete system journal has:
Jan 22 21:42:29 falafel openqa-scheduler[1282]: Can't create listen socket: Address family for hostname not supported at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 124.
…
Jan 22 21:42:30 falafel NetworkManager[903]: <info> [1579725750.2029] device (eth1): carrier: link connected
so the openQA services choke on a non-ready network connection.
As a workaround the systemd services can wait for the network being online as described on https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ . I applied that workaround on the machine now:
# systemctl cat openqa-scheduler
# /usr/lib/systemd/system/openqa-scheduler.service
[Unit]
Description=The openQA Scheduler
After=postgresql.service openqa-setup-db.service
Wants=openqa-setup-db.service
[Service]
User=geekotest
ExecStart=/usr/share/openqa/script/openqa-scheduler daemon -m production
TimeoutStopSec=120
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/openqa-scheduler.service.d/override.conf
[Unit]
After=network-online.target
Wants=network-online.target
same was necessary in /etc/systemd/system/openqa-livehandler.service.d/override.conf
however a fix needs to be done in either openQA or Mojo::IOLoop, creating a specific ticket for that: #62567
Updated by okurz almost 5 years ago
- Copied to action #62567: openqa services can fail when network is not up (yet) "Can't create listen socket: Address family for hostname not supported" added
Updated by okurz almost 5 years ago
@Sofia IMHO I have fixed the issues for you for now. Summary of my changes:
- Upgrade openQA packages again to include a fix for #62015
- Apply a workaround to systemd services to wait for the network being online before starting the services
At the time of writing 5 test jobs were still running in parallel. If you are happy with the current state we can resolve the ticket. A proper fix for the network dependency is planned in #62567
Updated by syrianidou_sofia almost 5 years ago
- Status changed from Feedback to In Progress
I just checked. Indeed it seems to be working now. We can close this ticket. Thanks a lot.
Updated by syrianidou_sofia almost 5 years ago
- Status changed from In Progress to Feedback
Updated by okurz almost 5 years ago
- Status changed from Feedback to Resolved
- Target version changed from Current Sprint to Done
Updated by mkittler almost 5 years ago
Good that the latest version doesn't show the job scheduling problems anymore. I remember that we had a discussion about the network availability earlier.