Project

General

Profile

Actions

action #6618

closed

The job modules aren't updated correctly

Added by coolo about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2015-03-10
Due date:
% Done:

0%

Estimated time:

Description

I'm still looking for a theory how this can happen, but the result of the modules aren't updated reliable:

id | job_id | name | script | category | t_created | t_updated | result | soft_failure | milestone | important | fatal
--------+--------+-----------------------+---------------------------------------------+--------------+---------------------+---------------------+---------+--------------+-----------+-----------+-------
510801 | 16276 | isosize | tests/installation/isosize.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:26:24 | passed | 0 | 0 | 1 | 0
510802 | 16276 | bootloader | tests/installation/bootloader.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:26:24 | passed | 0 | 0 | 0 | 1
510803 | 16276 | welcome | tests/installation/welcome.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:27:41 | passed | 0 | 0 | 0 | 1
510804 | 16276 | scc_registration | tests/installation/scc_registration.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:28:00 | passed | 0 | 0 | 0 | 1
510805 | 16276 | addon_products_sle | tests/installation/addon_products_sle.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:28:16 | passed | 0 | 0 | 0 | 1
510806 | 16276 | partitioning | tests/installation/partitioning.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:28:26 | passed | 0 | 0 | 0 | 1
510807 | 16276 | partitioning_finish | tests/installation/partitioning_finish.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:28:26 | passed | 0 | 0 | 0 | 1
510808 | 16276 | installer_timezone | tests/installation/installer_timezone.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:29:05 | passed | 0 | 0 | 0 | 1
510809 | 16276 | logpackages | tests/installation/logpackages.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:29:05 | passed | 0 | 0 | 0 | 1
510810 | 16276 | user_settings | tests/installation/user_settings.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:29:05 | running | 0 | 0 | 0 | 1
510811 | 16276 | user_settings_root | tests/installation/user_settings_root.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:29:25 | passed | 0 | 0 | 0 | 1
510812 | 16276 | installation_overview | tests/installation/installation_overview.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:29:35 | passed | 0 | 0 | 0 | 1
510813 | 16276 | start_install | tests/installation/start_install.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:29:48 | passed | 0 | 0 | 0 | 1
510814 | 16276 | livecdreboot | tests/installation/livecdreboot.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:38:56 | passed | 0 | 0 | 0 | 1
510815 | 16276 | first_boot | tests/installation/first_boot.pm | installation | 2015-03-10 11:25:25 | 2015-03-10 11:39:30 | passed | 0 | 1 | 1 | 1
510816 | 16276 | consoletest_setup | tests/console/consoletest_setup.pm | console | 2015-03-10 11:25:25 | 2015-03-10 11:41:40 | passed | 0 | 1 | 1 | 1
510817 | 16276 | textinfo | tests/console/textinfo.pm | console | 2015-03-10 11:25:25 | 2015-03-10 11:42:30 | passed | 0 | 0 | 0 | 0
510818 | 16276 | hostname | tests/console/hostname.pm | console | 2015-03-10 11:25:25 | 2015-03-10 11:42:56 | passed | 0 | 1 | 1 | 1
510819 | 16276 | xorg_vt | tests/console/xorg_vt.pm | console | 2015-03-10 11:25:25 | 2015-03-10 11:43:05 | passed | 0 | 0 | 1 | 0
510820 | 16276 | zypper_lr | tests/console/zypper_lr.pm | console | 2015-03-10 11:25:25 | 2015-03-10 11:44:05 | passed | 0 | 0 | 0 | 0
510821 | 16276 | zypper_ref | tests/console/zypper_ref.pm | console | 2015-03-10 11:25:25 | 2015-03-10 11:46:35 | passed | 0 | 1 | 1 | 0
510822 | 16276 | yast2_lan | tests/console/yast2_lan.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:48:41 | passed | 0 | 0 | 0 | 0
510823 | 16276 | aplay | tests/console/aplay.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:50:15 | passed | 0 | 0 | 0 | 0
510824 | 16276 | glibc_i686 | tests/console/glibc_i686.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:50:45 | passed | 0 | 0 | 0 | 0
510825 | 16276 | zypper_up | tests/console/zypper_up.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:52:56 | passed | 0 | 1 | 0 | 0
510826 | 16276 | zypper_in | tests/console/zypper_in.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:52:56 | running | 0 | 0 | 0 | 0
510827 | 16276 | yast2_i | tests/console/yast2_i.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:25:26 | none | 0 | 0 | 0 | 0
510828 | 16276 | yast2_bootloader | tests/console/yast2_bootloader.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:57:24 | passed | 0 | 0 | 0 | 0
510829 | 16276 | sshd | tests/console/sshd.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:59:51 | passed | 0 | 1 | 0 | 0
510830 | 16276 | mtab | tests/console/mtab.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:59:51 | running | 0 | 0 | 0 | 0
510831 | 16276 | consoletest_finish | tests/console/consoletest_finish.pm | console | 2015-03-10 11:25:26 | 2015-03-10 11:25:26 | none | 0 | 1 | 1 | 1


Files

l (113 KB) l coolo, 2015-03-10 16:12
Actions #1

Updated by coolo about 9 years ago

  • Status changed from New to Resolved
Actions #2

Updated by coolo about 9 years ago

  • Status changed from Resolved to New

No, that's not the only reason

id | job_id | name | script | category | t_created | t_updated | result | soft_failure | milestone | important | fatal
--------+--------+-----------------------+---------------------------------------------+--------------+---------------------+---------------------+---------+--------------+-----------+-----------+-------
512108 | 16300 | isosize | tests/installation/isosize.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 1 | 0
512109 | 16300 | bootloader_uefi | tests/installation/bootloader_uefi.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 0 | 1
512110 | 16300 | welcome | tests/installation/welcome.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 0 | 1
512111 | 16300 | scc_registration | tests/installation/scc_registration.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 0 | 1
512112 | 16300 | addon_products_sle | tests/installation/addon_products_sle.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 0 | 1
512113 | 16300 | partitioning | tests/installation/partitioning.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 0 | 1
512114 | 16300 | partitioning_finish | tests/installation/partitioning_finish.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:26 | none | 0 | 0 | 0 | 1
512115 | 16300 | installer_timezone | tests/installation/installer_timezone.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:25:57 | passed | 0 | 0 | 0 | 1
512116 | 16300 | logpackages | tests/installation/logpackages.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:25:57 | passed | 0 | 0 | 0 | 1
512117 | 16300 | user_settings | tests/installation/user_settings.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:26:06 | passed | 0 | 0 | 0 | 1
512118 | 16300 | user_settings_root | tests/installation/user_settings_root.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:26:11 | passed | 0 | 0 | 0 | 1
512119 | 16300 | installation_overview | tests/installation/installation_overview.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:26:21 | passed | 0 | 0 | 0 | 1
512120 | 16300 | start_install | tests/installation/start_install.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:26:47 | passed | 0 | 0 | 0 | 1
512121 | 16300 | livecdreboot | tests/installation/livecdreboot.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:26:47 | running | 0 | 0 | 0 | 1
512122 | 16300 | first_boot | tests/installation/first_boot.pm | installation | 2015-03-10 13:22:26 | 2015-03-10 13:22:27 | none | 0 | 1 | 1 | 1
512123 | 16300 | consoletest_setup | tests/console/consoletest_setup.pm | console | 2015-03-10 13:22:27 | 2015-03-10 13:22:27 | none | 0 | 1 | 1 | 1


Mär 10 14:24:23 ix64ph1072 worker[32602]: 8515: WORKING 16300
Mär 10 14:25:43 ix64ph1072 worker[32602]: 2: Connection error: Premature connection close at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 194.
Mär 10 14:26:33 ix64ph1072 worker[32602]: 2: Connection error: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 194.
Mär 10 14:27:28 ix64ph1072 worker[32602]: 2: Connection error: Inactivity timeout at /usr/share/openqa/script/../lib/OpenQA/Worker/Common.pm line 194.

Actions #3

Updated by coolo about 9 years ago

  • Assignee deleted (coolo)
Actions #4

Updated by coolo about 9 years ago

the api log has interesting gaps

10.162.2.72 - - [10/Mar/2015:14:22:26 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:22:31 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:22:36 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:22:41 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:22:46 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:22:51 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:22:56 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:23:01 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 - "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:23:46 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:23:51 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 - "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:24:36 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:24:41 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:24:46 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 - "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:25:31 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"
10.162.2.72 - - [10/Mar/2015:14:25:36 +0100] "POST /api/v1/jobs/16300/status HTTP/1.1" 200 12 "-" "Mojolicious (Perl)"

Actions #5

Updated by coolo about 9 years ago

[Tue Mar 10 14:23:50 2015] [debug] 200 OK (1.200226s, 0.833/s).
[Tue Mar 10 14:23:50 2015] [debug] POST "/api/v1/jobs/16295/status".
[Tue Mar 10 14:23:50 2015] [debug] Routing to controller "OpenQA::Controller::API::V1" and action "auth".
[Tue Mar 10 14:23:50 2015] [debug] API key from client: AD52F5D0A6FDDB53
[Tue Mar 10 14:23:50 2015] [debug] API auth by user: https://www.opensuse.org/openid/user/lnussel, operator: 1
[Tue Mar 10 14:23:50 2015] [debug] Routing to controller "OpenQA::Controller::API::V1::Job" and action "update_status".
[Tue Mar 10 14:24:35 2015] [debug] DIR /var/lib/openqa/testresults/00016295-sle-11-SP4-Alpha-Server-DVD-i586-Build0941-RAID5

[Tue Mar 10 14:24:35 2015] [debug] DIR /var/lib/openqa/testresults/00016295-sle-11-SP4-Alpha-Server-DVD-i586-Build0941-RAID5

[Tue Mar 10 14:24:36 2015] [debug] 200 OK (45.633372s, 0.022/s).
[Tue Mar 10 14:24:36 2015] [debug] Inactivity timeout.
[Tue Mar 10 14:24:36 2015] [debug] Worker 9 websocket connection closed - 1006
[Tue Mar 10 14:24:36 2015] [debug] Inactivity timeout.
[Tue Mar 10 14:24:36 2015] [debug] Worker 6 websocket connection closed - 1006
[Tue Mar 10 14:24:36 2015] [debug] Inactivity timeout.
[Tue Mar 10 14:24:36 2015] [debug] Worker 7 websocket connection closed - 1006
[Tue Mar 10 14:24:36 2015] [debug] Inactivity timeout.
[Tue Mar 10 14:24:36 2015] [debug] Worker 11 websocket connection closed - 1006
[Tue Mar 10 14:24:36 2015] [debug] Inactivity timeout.
[Tue Mar 10 14:24:36 2015] [debug] Worker 8 websocket connection closed - 1006
[Tue Mar 10 14:24:36 2015] [debug] Inactivity timeout.
[Tue Mar 10 14:24:36 2015] [debug] Worker 10 websocket connection closed - 1006
[Tue Mar 10 14:24:36 2015] [debug] POST "/api/v1/jobs/16300/status".

Note how one POST /status takes 45s to save the PNGs - that means there were 8 more queued in that time ;(

Actions #6

Updated by oholecek about 9 years ago

45s for upload is really bad. How is the WebUI handling this? Isn't it stuck during uploads? After #241 update_status calls are rejected when api call is in progress. I'm afraid that may skip some screen shots, I don't yet fully understand new upload_status. :(

Actions #7

Updated by coolo about 9 years ago

I can split the upload_status function into several /status calls, but I'm afraid this will open the whole for a 2nd timer even more. Possibly we should make update_status not reoccuring but one shot timers triggered by the previous call?

Actions #8

Updated by oholecek about 9 years ago

Hmm.. how will that solve this problem? It won't make changes any smaller since we will still need to wait until everything from previous call is uploaded. That is effectively what we have now (after #241), if api call is running event from recurrent timer is ignored.

What about we stop uploading screen shots for live view and proxy VNC directly? (e.g. https://github.com/InstantWebP2P/peer-vnc/tree/master/ ), or just upload them only when someone is actually looking?
And the only screen shots to be uploaded are those passed from needle match (e.g. assert_screen)?

Edit: I'm assuming here that the screen shots are the main bandwidth hog.

Actions #9

Updated by coolo about 9 years ago

uploading the live screenshot only when someone looks is ok - as long as we upload one as soon as we get the command. Possibly the same for the log

But the 45 seconds are spend in openQA not in the worker, so my approach was to split the one huge api call into several smaller. And let's get prefork done :)

Actions #10

Updated by coolo about 9 years ago

  • File l l added

I deployed #241 now and still get this:

id | job_id | name | script | category | t_created | t_updated | result | soft_failure | milestone | important | fatal
--------+--------+----------------------------------+--------------------------------------------------------+--------------+---------------------+---------------------+---------+--------------+-----------+-----------+-------
513240 | 16340 | bootloader | tests/installation/bootloader.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:00:22 | passed | 0 | 0 | 0 | 1
513241 | 16340 | welcome | tests/installation/welcome.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:00:22 | running | 0 | 0 | 0 | 1
513242 | 16340 | check_medium | tests/installation/check_medium.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:01:26 | passed | 0 | 0 | 0 | 0
513243 | 16340 | installation_mode | tests/installation/installation_mode.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:01:34 | passed | 0 | 0 | 0 | 1
513244 | 16340 | installer_timezone | tests/installation/installer_timezone.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:01:34 | running | 0 | 0 | 0 | 1
513245 | 16340 | server_base_scenario | tests/installation/server_base_scenario.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:01:54 | passed | 0 | 0 | 0 | 0
513246 | 16340 | logpackages | tests/installation/logpackages.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 16:01:54 | running | 0 | 0 | 0 | 1
513247 | 16340 | partitioning_raid_sle11 | tests/installation/partitioning_raid_sle11.pm | installation | 2015-03-10 15:59:13 | 2015-03-10 15:59:13 | none | 0 | 0 | 0 | 1

But the worker log shows nothing too obvious - note though that the time on worker and webui are off. I'll enable ntp now :)

Actions #11

Updated by oholecek about 9 years ago

I was looking into how results are collected in upload_status. I'm not yet completely in the picture, but doesn't it always write the complete status of indivitual tests from beggining to current_running?

1) read tests from $test_order (:257)
2) and then collects results for each test in read_result_file (:319, called :288).

If so, than there must be problem somewhere else and not in upload (storing on isotovideo or evaluation on scheduler side)

Actions #12

Updated by coolo about 9 years ago

no, it doesn't. This removes the test when uploading:
my $test = (shift @$test_order)->{name};

Actions #13

Updated by coolo about 9 years ago

  • Status changed from New to Resolved

I didn't spot it today at all

Actions #14

Updated by coolo about 9 years ago

  • Status changed from Resolved to New
  • Assignee set to oholecek
  • Priority changed from Normal to High

the problem seems to be that POST hits a busy mojo and as we use a single thread the POST is discarded. Apache sends a 200 for this error, so the worker will ignore it too ;(

The most straightforward (hopefully) solution is using the websocket.

It didn't happen yesterday most likely as we had not as much TODO

Actions #15

Updated by oholecek about 9 years ago

  • Status changed from New to In Progress

Workaround for POST requests is checking that json response is !null (https://github.com/os-autoinst/openQA/pull/266)
event updates using WebSockets (https://github.com/os-autoinst/openQA/pull/267) - experimental, set WORKER_USE_WEBSOCKETS to enable

Lets see if this will help.

Actions #16

Updated by coolo about 9 years ago

  • Status changed from In Progress to Resolved

no complaints today

Actions

Also available in: Atom PDF