Project

General

Profile

action #54869

improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.)

Added by JRivrain over 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2019-07-30
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Test fails as incomplete and there is no clue in the logs of what's happening. This is pretty annoying since other tests depend on this one.
The logs look identical in working and non-working runs.

I could observe that the "upload" phase lasts very long, like at least 30 minutes.

Test suite description

image creation job used as parent for other jobs testing based on existing installation. To be used as START_AFTER_TEST=create_hdd_gnome

Reproducible

Fails since (at least) Build 0251 (current job)

Expected result

Last good: 0248 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Project - action #54902: openQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error"Resolved2019-07-31

Related to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolved2018-12-12

Copied to openQA Infrastructure - action #55061: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08"Resolved2019-07-30

History

#1 Updated by JRivrain over 2 years ago

  • Priority changed from Normal to High

#2 Updated by JERiveraMoya over 2 years ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Category deleted (Bugs in existing tests)

#3 Updated by JRivrain over 2 years ago

Probably linked to https://progress.opensuse.org/issues/54902, but in the example I gave the error message "502 response: Proxy Error" does not appear.

#4 Updated by okurz over 2 years ago

  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Current Sprint

looking into this. With ssh openqaworker-arm-2 "sudo tail -f /var/lib/openqa/pool/3/{autoinst,worker}-log.txt" in a screen session on lord.arch I am monitoring the current job https://openqa.suse.de/tests/3203400:

[2019-08-02T10:20:20.342 UTC] [debug] done with backend process                                                                                               
42905: EXIT 0                                                                                                                                                 
[2019-08-02T10:20:20.0615 UTC] [info] [pid:50163] Isotovideo exit status: 0                                                                                   
[2019-08-02T10:20:21.0314 UTC] [info] [pid:50163] +++ worker notes +++                                                                                        
[2019-08-02T10:20:21.0314 UTC] [info] [pid:50163] End time: 2019-08-02 10:20:21                                                                               
[2019-08-02T10:20:21.0315 UTC] [info] [pid:50163] Result: done                                                                                                
[2019-08-02T10:20:21.0315 UTC] [info] [pid:50163] Uploading logs_from_installation_system-y2logs.tar.bz2                                                      
[2019-08-02T10:20:22.0693 UTC] [info] [pid:50163] Uploading SLES-12-SP5-aarch64-Build0256@aarch64-gnome-uefi-vars.qcow2                                       
[2019-08-02T10:20:26.0842 UTC] [info] [pid:50163] Uploading SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2                                                 

==> /var/lib/openqa/pool/3/worker-log.txt <==                                                                                                                 
[2019-08-02T12:05:48.0070 UTC] [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1020/1850 avg speed ~122.070KB/s        
[2019-08-02T12:05:57.0820 UTC] [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1021/1850 avg speed ~108.507KB/s

so the job is already "done" since 2 hours and is currently uploading but very slow. Currently it still looks like an actual infrastructure issue.

Calling dmesg what I see repeatedly is:

[Fri Aug  2 11:59:28 2019] /usr/bin/isotov[55781]: unhandled level 2 translation fault (11) at 0x0000055c, esr 0x92000006
[Fri Aug  2 11:59:28 2019] pgd = ffff80010642d000
[Fri Aug  2 11:59:28 2019] [0000055c] *pgd=00000001791f7003, *pud=00000009879b9003, *pmd=0000000000000000

[Fri Aug  2 11:59:28 2019] CPU: 28 PID: 55781 Comm: /usr/bin/isotov Tainted: G        W          4.4.114-94.11-default #1
[Fri Aug  2 11:59:28 2019] Hardware name: GIGABYTE R270-T64-00/MT60-SC4-00, BIOS T32 03/03/2017
[Fri Aug  2 11:59:28 2019] task: ffff8009b6d99080 ti: ffff800104e10000 task.ti: ffff800104e10000
[Fri Aug  2 11:59:28 2019] PC is at 0x488180
[Fri Aug  2 11:59:28 2019] LR is at 0x488164
[Fri Aug  2 11:59:28 2019] pc : [<0000000000488180>] lr : [<0000000000488164>] pstate: 20000000
[Fri Aug  2 11:59:28 2019] sp : 0000ffff90bbb650
[Fri Aug  2 11:59:28 2019] x29: 0000ffff90bbb650 x28: 0000ffff90bbc950 
[Fri Aug  2 11:59:28 2019] x27: 0000ffff90bbc94c x26: 000000000000183b 
[Fri Aug  2 11:59:28 2019] x25: 000000000df96b97 x24: 0000000000000000 
[Fri Aug  2 11:59:28 2019] x23: 000000000df96b98 x22: 0000000000003077 
[Fri Aug  2 11:59:28 2019] x21: 0000000000000000 x20: 0000ffff90bbc970 
[Fri Aug  2 11:59:28 2019] x19: 000000000000000f x18: 0000ffff92d6fe00 
[Fri Aug  2 11:59:28 2019] x17: 0000ffff92d7cd74 x16: 0000000000580350 
[Fri Aug  2 11:59:28 2019] x15: 0000ffff92eee000 x14: 0000ffff92d6fc74 
[Fri Aug  2 11:59:28 2019] x13: 0000ffff90bbc800 x12: 0000000019f63bf0 
[Fri Aug  2 11:59:28 2019] x11: 0000000000000300 x10: 0000000000000003 
[Fri Aug  2 11:59:28 2019] x9 : 0000000000000400 x8 : 0000000000000062 
[Fri Aug  2 11:59:28 2019] x7 : 0000000000000000 x6 : 00000000ffffffbb 
[Fri Aug  2 11:59:28 2019] x5 : 0000ffff90bbd8f0 x4 : 0000ffff90bbc8cc 
[Fri Aug  2 11:59:28 2019] x3 : 0000000000000000 x2 : 0000000000000000 
[Fri Aug  2 11:59:28 2019] x1 : 0000000000000000 x0 : 0000000000000000 

first occurence is at least on [Tue Jul 9 09:50:21 2019].

# xzgrep Tainted messages*
messages:2019-08-02T00:14:35.347943+00:00 openqaworker-arm-2 kernel: [2649626.018815] CPU: 13 PID: 72009 Comm: /usr/bin/isotov Tainted: G        W          4.4.114-94.11-default #1
[…]
messages-20190425.xz:2019-04-24T10:47:01.119250+00:00 openqaworker-arm-2 kernel: [1116505.781782] CPU: 63 PID: 72879 Comm: /usr/bin/isotov Tainted: G        W          4.4.114-94.11-default #1

so this goes back for very long and is therefore less likely to be the cause of the same issue. TODO report upstream

#5 Updated by okurz over 2 years ago

  • Project changed from openQA Infrastructure to openQA Project
  • Assignee changed from okurz to mkittler

The oldest job with comparable symptoms seems to be https://openqa.suse.de/tests/3093779 on openqaworker-arm-2:1 which is "incomplete" even though both the autoinst-log.txt as well as worker-log.txt seem to be complete and also do not point out any related error. However the logs look like the "restructured worker" was already in place hence reassigning the ticket to "openQA Project" and assigning to mkittler.

#6 Updated by okurz over 2 years ago

  • Related to action #54902: openQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error" added

#7 Updated by okurz over 2 years ago

As expected in the end the job went into timeout but was still uploading, from the worker log while still attached with tail:

[2019-08-02T14:18:09.0253 UTC] [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1566/1850 avg speed ~88.778KB/s
[2019-08-02T14:18:12.0539 UTC] [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1567/1850 avg speed ~325.521KB/s
[2019-08-02T14:18:17.0542 UTC] [info] [pid:50163] Quit due to signal TERM
[2019-08-02T14:18:19.0073 UTC] [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1568/1850 avg speed ~139.509KB/s

From the journal on the log I can see:

Aug 02 14:18:12 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1567/1850 avg speed ~325.521KB/s
Aug 02 14:18:17 openqaworker-arm-2 systemd[1]: Stopping openQA Worker #3...
Aug 02 14:18:17 openqaworker-arm-2 worker[50163]: [info] [pid:50163] Quit due to signal TERM
Aug 02 14:18:19 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1568/1850 avg speed ~139.509KB/s
Aug 02 14:18:32 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1569/1850 avg speed ~75.120KB/s
Aug 02 14:18:38 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1570/1850 avg speed ~162.760KB/s
Aug 02 14:18:49 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1571/1850 avg speed ~88.778KB/s
Aug 02 14:18:52 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1572/1850 avg speed ~325.521KB/s
Aug 02 14:18:59 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1573/1850 avg speed ~139.509KB/s
Aug 02 14:19:07 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1574/1850 avg speed ~122.070KB/s
Aug 02 14:19:26 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1575/1850 avg speed ~51.398KB/s
Aug 02 14:19:28 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1576/1850 avg speed ~488.281KB/s
Aug 02 14:19:37 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1577/1850 avg speed ~108.507KB/s
Aug 02 14:19:47 openqaworker-arm-2 systemd[1]: openqa-worker@3.service: State 'stop-sigterm' timed out. Killing.
Aug 02 14:19:47 openqaworker-arm-2 systemd[1]: openqa-worker@3.service: Main process exited, code=killed, status=9/KILL
Aug 02 14:19:47 openqaworker-arm-2 systemd[1]: Stopped openQA Worker #3.
Aug 02 14:19:47 openqaworker-arm-2 systemd[1]: openqa-worker@3.service: Unit entered failed state.
Aug 02 14:19:47 openqaworker-arm-2 systemd[1]: openqa-worker@3.service: Failed with result 'signal'.
Aug 02 14:19:47 openqaworker-arm-2 systemd[1]: Starting openQA Worker #3...

so it seems that something was stopping the actual systemd service at "Aug 02 14:18:17". The worker did not react to SIGTERM so at 14:19 systemd send KILL to the process. I did not see that e.g. instance 1 or 2 were stopped. At the time of 14:18 they were just idling and waiting for new jobs.

From the complete system journal I could find:

Aug 02 14:18:17 openqaworker-arm-2 systemd[1]: Starting Collect system and openqa metrics and send them to graphite server...
Aug 02 14:18:17 openqaworker-arm-2 systemd[1]: Stopping openQA Worker #3...

is this just a coincidence or did the first trigger the second? as in openqa-metrics.service terminated the worker?

#8 Updated by okurz over 2 years ago

  • Copied to action #55061: openqa-metrics.service failed on openqaworker-arm-2 since "Jul 24 17:07:08" added

#9 Updated by szarate over 2 years ago

From the complete system journal I could find:

Aug 02 14:18:17 openqaworker-arm-2 systemd[1]: Starting Collect system and openqa metrics and send them to graphite server...
Aug 02 14:18:17 openqaworker-arm-2 systemd[1]: Stopping openQA Worker #3...

is this just a coincidence or did the first trigger the second? as in openqa-metrics.service terminated the worker?

on openqaworker-arm-2, instance 3, it was me. But if you look at the autoinst-log from other instances, you will see that there are proxy errors

Aug 02 14:18:19 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1568/1850 avg speed ~139.509KB/s
Aug 02 14:18:32 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1569/1850 avg speed ~75.120KB/s
Aug 02 14:18:38 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1570/1850 avg speed ~162.760KB/s
Aug 02 14:18:49 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1571/1850 avg speed ~88.778KB/s
Aug 02 14:18:52 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1572/1850 avg speed ~325.521KB/s
Aug 02 14:18:59 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1573/1850 avg speed ~139.509KB/s
Aug 02 14:19:07 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1574/1850 avg speed ~122.070KB/s
Aug 02 14:19:26 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1575/1850 avg speed ~51.398KB/s
Aug 02 14:19:28 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1576/1850 avg speed ~488.281KB/s
Aug 02 14:19:37 openqaworker-arm-2 worker[50163]: [info] [pid:50163] SLES-12-SP5-aarch64-Build0256@aarch64-gnome.qcow2: Processing chunk 1577/1850 avg speed ~108.507KB/s

Internet in my home country can be faster. There are other jobs that you can investigate too, with similar symptoms

#10 Updated by mkittler over 2 years ago

I haven't looked into this in details but the log from szarate looks like the uploads took too long so the web socket server has marked the job as incomplete. The unresponsiveness of the worker is something we have to tackle (there are also other issues about that).

Note that the latest build even has the proxy error in the logs:

[2019-07-30T12:41:45.0125 UTC] [error] [pid:34097] Failed uploading chunk
[2019-07-30T12:41:45.0125 UTC] [error] [pid:34097] Error uploading SLES-12-SP5-aarch64-Build0251@aarch64-gnome.qcow2: 502 response: Proxy Error
[2019-07-30T12:41:45.0126 UTC] [error] [pid:34097] Upload failed for chunk 992

#11 Updated by okurz over 2 years ago

  • Subject changed from Test fails as incomplete most of the time, no clue what happens from the logs. to improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.)
  • Category set to Feature requests
  • Status changed from Feedback to New
  • Assignee deleted (mkittler)
  • Priority changed from High to Normal
  • Target version deleted (Current Sprint)

ok, fine. So that seems to be what #54902 is talking about. Then I suggest we focus this ticket on "improve feedback in case job is incompleted due to too long uploading" and focus on #54902 for fixing the root cause.

#12 Updated by okurz almost 2 years ago

#13 Updated by okurz over 1 year ago

  • Status changed from New to Resolved
  • Assignee set to okurz

Meanwhile we have better reporting in case of incomplete jobs, e.g. with the "reason" that is accessible over DB, API and UI. Also we have improved the uploading by doing it asynchronously, speed it up, prevent double uploads as well as fix problems that lead to sporadically very slow upload rates.

Also available in: Atom PDF