Project

General

Profile

Actions

action #57434

closed

openqaworker-arm-3 keeps running jobs for 18 hours

Added by coolo about 5 years ago. Updated about 5 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
-
Start date:
2019-09-27
Due date:
% Done:

0%

Estimated time:

Description

And it doesn't look like it will fix itself. The worker instances all show the same symptom:

ppoll([{fd=8, events=POLLIN|POLLPRI|POLLOUT}], 1, {tv_sec=300, tv_nsec=0}, NULL, 0) = 1 ([{fd=8, revents=POLLOUT}], left {tv_sec=299, tv_nsec=999997579})
write(8, "POST /status HTTP/1.1\r\nContent-Type: application/json\r\nContent-Length: 71\r\nHost: 127.0.0.1:7844\r\nAccept-Encoding: gzip\r\nUser-Agent: Mojolicious (Perl)\r\n\r\n{\"id\":14,\"lock\":\"SLES-15-SP1-aarch64-Installtest.qcow2.openqa.suse.de\"}", 225) = 225
ppoll([{fd=8, events=POLLIN|POLLPRI}], 1, {tv_sec=300, tv_nsec=0}, NULL, 0) = 1 ([{fd=8, revents=POLLIN}], left {tv_sec=299, tv_nsec=996297917})
read(8, "HTTP/1.1 404 Not Found\r\nServer: Mojolicious (Perl)\r\nContent-Length: 40\r\nContent-Type: application/json;charset=UTF-8\r\nDate: Fri, 27 Sep 2019 04:49:50 GMT\r\n\r\n{\"error\":\"Specified job ID is invalid.\"}", 131072) = 197
getpid()                                = 21412

It asks the cache service to download a file and gets an error - in en endless loop. The cache service in return looks like this:

read(18, "POST /status HTTP/1.1\r\nContent-Length: 87\r\nContent-Type: application/json\r\nAccept-Encoding: gzip\r\nUser-Agent: Mojolicious (Perl)\r\nHost: 127.0.0.1:7844\r\n\r\n{\"id\":15,\"lock\":\"SLE-15-SP2-Installer-DVD-aarch64-Build45.1-Media1.iso.openqa.suse.de\"}", 131072) = 241
getsockname(18, {sa_family=AF_INET, sin_port=htons(7844), sin_addr=inet_addr("127.0.0.1")}, [256->16]) = 0
getsockname(18, {sa_family=AF_INET, sin_port=htons(7844), sin_addr=inet_addr("127.0.0.1")}, [256->16]) = 0
getpid()                                = 96160
newfstatat(AT_FDCWD, "/var/lib/openqa/cache/cache.sqlite", {st_mode=S_IFREG|0644, st_size=45056, ...}, 0) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
getpid()                                = 96160
newfstatat(AT_FDCWD, "/var/lib/openqa/cache/cache.sqlite", {st_mode=S_IFREG|0644, st_size=45056, ...}, 0) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
getpid()                                = 96160
newfstatat(AT_FDCWD, "/var/lib/openqa/cache/cache.sqlite", {st_mode=S_IFREG|0644, st_size=45056, ...}, 0) = 0
fcntl(10, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(10, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
write(19, "HTTP/1.1 404 Not Found\r\nDate: Fri, 27 Sep 2019 04:54:12 GMT\r\nContent-Type: application/json;charset=UTF-8\r\nServer: Mojolicious (Perl)\r\nContent-Length: 40\r\n\r\n{\"error\":\"Specified job ID is invalid.\"}", 197) = 197
ppoll([{fd=21, events=POLLIN|POLLPRI}, {fd=23, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=18, events=POLLIN|POLLPRI|POLLOUT}, {fd=5, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN|POLLPRI}, {fd=22, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=19, events=POLLIN|POLLPRI|POLLOUT}, {fd=20, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN|POLLPRI}], 16, {tv_sec=0, tv_nsec=488000000}, NULL, 0) = 2 ([{fd=18, revents=POLLOUT}, {fd=19, revents=POLLOUT}], left {tv_sec=0, tv_nsec=487989739})
write(18, "HTTP/1.1 404 Not Found\r\nContent-Type: application/json;charset=UTF-8\r\nDate: Fri, 27 Sep 2019 04:54:12 GMT\r\nServer: Mojolicious (Perl)\r\nContent-Length: 40\r\n\r\n{\"error\":\"Specified job ID is invalid.\"}", 197) = 197
ppoll([{fd=21, events=POLLIN|POLLPRI}, {fd=23, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=18, events=POLLIN|POLLPRI|POLLOUT}, {fd=5, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN|POLLPRI}, {fd=22, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=19, events=POLLIN|POLLPRI}, {fd=20, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN|POLLPRI}], 16, {tv_sec=0, tv_nsec=487000000}, NULL, 0) = 1 ([{fd=18, revents=POLLOUT}], left {tv_sec=0, tv_nsec=486991100})
ppoll([{fd=21, events=POLLIN|POLLPRI}, {fd=23, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}, {fd=18, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}, {fd=15, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN|POLLPRI}, {fd=22, events=POLLIN|POLLPRI}, {fd=13, events=POLLIN|POLLPRI}, {fd=11, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=19, events=POLLIN|POLLPRI}, {fd=20, events=POLLIN|POLLPRI}, {fd=17, events=POLLIN|POLLPRI}, {fd=14, events=POLLIN|POLLPRI}, {fd=16, events=POLLIN|POLLPRI}], 16, {tv_sec=0, tv_nsec=487000000}, NULL, 0) = 1 ([{fd=19, revents=POLLIN}], left {tv_sec=0, tv_nsec=486560769})
read(19, "POST /status HTTP/1.1\r\nAccept-Encoding: gzip\r\nHost: 127.0.0.1:7844\r\nUser-Agent: Mojolicious (Perl)\r\nContent-Length: 77\r\nContent-Type: application/json\r\n\r\n{\"id\":20,\"lock\":\"sle-15-SP2-aarch64-Build45.1-with-hpc.qcow2.openqa.suse.de\"}", 131072) = 231

As arm-3 was offline, I assume this is due to a corrupt DB?


Related issues 1 (0 open1 closed)

Is duplicate of openQA Infrastructure (public) - action #54128: [tools] openqaworker-arm-3 is brokenResolvedokurz2019-07-11

Actions
Actions #1

Updated by coolo about 5 years ago

I'll restart the services

Actions #2

Updated by okurz about 5 years ago

  • Is duplicate of action #54128: [tools] openqaworker-arm-3 is broken added
Actions #3

Updated by okurz about 5 years ago

  • Status changed from New to Rejected
  • Assignee set to okurz

we know that already. You wanted me to update tickets less so I tend to update my comments more often. See #54128#note-8

Actions #4

Updated by coolo about 5 years ago

  • Target version deleted (Ready)
Actions

Also available in: Atom PDF