Project

General

Profile

Actions

action #123649

closed

Error message in gru logs: Could not chdir back to start dir '' size:M

Added by tinita almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-01-25
Due date:
2023-02-09
% Done:

0%

Estimated time:

Description

Observation

I found these error messages in the gru journal on o3:

Jan 24 11:08:40 ariel openqa-gru[19289]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.                                                                                            
Jan 24 11:14:21 ariel openqa-gru[16615]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.                                                                                            
Jan 24 11:16:33 ariel openqa-gru[17881]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.                                                                                            
Jan 24 11:16:34 ariel openqa-gru[16571]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.                                                                                            
Jan 24 11:19:07 ariel openqa-gru[16544]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.                                                                                            
Jan 24 11:24:12 ariel openqa-gru[20153]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.
Jan 24 11:30:32 ariel openqa-gru[27467]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.
Jan 24 11:30:43 ariel openqa-gru[25253]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.

Acceptance criteria

  • AC1: gru logs don't contain these error messages

Suggestions

  • Try changing the working directory before invoking code from Archive::Extract
  • As desired artifically trigger the problem as mentioned in #123649#note-15
  • Check comments for reproducing the issue in production
  • Could it be multi-processes trying to access working with the same directories?
  • Could it be a system owned tmpfile cleanup?
  • Optional: Consider ensuring we have a proper log level reported, e.g. "[error] …"

Related issues 1 (1 open0 closed)

Related to openQA Project (public) - action #123175: o3 fails to download images resulting in zero sized disk images/isosNew2023-01-16

Actions
Actions #1

Updated by tinita almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by mkittler almost 2 years ago

  • Assignee set to mkittler
Actions #3

Updated by mkittler almost 2 years ago

  • Related to action #123175: o3 fails to download images resulting in zero sized disk images/isos added
Actions #4

Updated by mkittler almost 2 years ago

Maybe this is what happened in #123175 - although the last 2 builds with relevant jobs for #123175 didn't reproduce the problem anymore.


This error message is actually expected to be stored as reason. However, I couldn't find any jobs via select id from jobs where reason like '%chdir%' order by id desc; on o3. So maybe in this particular case the error handling is broken. I'll try to reproduce.

Actions #5

Updated by mkittler almost 2 years ago

The case of "Could not determine archive type" is definitely handled correctly. When I manually throw an exception at the same location in Downloader.pm then the error is also handled correctly (resulting in a job with the exception message in the reason). So it is puzzling me why I cannot find such jobs on o3.

Actions #6

Updated by mkittler almost 2 years ago

AC1: gru logs don't contain unknown error messages

We could theoretically avoid it considering that the error message is actually stored as reason in the affected jobs. However, considering that I cannot find such jobs on o3 this may not be the best idea at this point.


So I see multiple problems:

  1. I cannot find the affected jobs. Maybe storing the error as reason doesn't work in this case although it did work when I provoked an error in the same location locally.
  2. The error itself. It could very well be just due to a broken archive. However, it is hard to tell without knowing the archive being extracted here (due to 1.).

Did this trigger some kind of alert/notification we need to take care of?

Actions #7

Updated by tinita almost 2 years ago

No, there was no alert, but ideally the gru logs should only contain "known" error messages.
Example:

grep was killed, possibly timed out: cmd=>grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' '/tmp/tmp.uvDJQTWjDE'< output=''

IMHO known errors should somehow be turned into [info] if we accept them as things that can happen.
It is not easy to look for unknown errors in the logs.
Or at least something like openqa-logwarn where we can ignore stuff.

But I tried to discuss this in the past already.

Actions #8

Updated by mkittler almost 2 years ago

It is very annoying that error messages of gru show up in the journal and less severe error messages in the logfile. However, the logfile at least reveals that the following archives were attempted to be extracted:

martchus@ariel:~> xzgrep -i -P 'Extracting' /var/log/openqa_gru*
/var/log/openqa_gru.13.xz:[2023-01-24T11:47:59.256956Z] [info] [#2198709] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk.x86_64-2.0.0-Disk-Build57.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk.x86_64-2.0.0-Disk-Build57.1.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T11:48:37.685868Z] [info] [#2198720] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk.x86_64-1.42.1-Build34.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk.x86_64-1.42.1-Build34.1.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T11:50:29.729464Z] [info] [#2198726] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk-v8.x86_64-1.3.0-Disk-Build53.2.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-v8.x86_64-1.3.0-Disk-Build53.2.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T11:54:19.329460Z] [info] [#2198734] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk-v9.x86_64-1.3.0-Disk-Build39.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-v9.x86_64-1.3.0-Disk-Build39.1.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T11:57:22.087206Z] [info] [#2198749] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk.x86_64-10.4-Disk-Build46.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk.x86_64-10.4-Disk-Build46.1.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T11:58:15.482793Z] [info] [#2198740] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk.x86_64-2.22.4-Disk-Build56.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk.x86_64-2.22.4-Disk-Build56.1.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T11:59:23.239320Z] [info] [#2198760] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.8.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.8.raw"
/var/log/openqa_gru.13.xz:[2023-01-24T12:00:23.194070Z] [info] [#2198765] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Virtual-Build56.8.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Virtual-Build56.8.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:05:40.981309Z] [info] [#2198560] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk-simple.x86_64-1.42.1-Build42.5.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk-simple.x86_64-1.42.1-Build42.5.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:05:45.037154Z] [info] [#2198555] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-custom-partitions.x86_64-1.15.1-Build45.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-custom-partitions.x86_64-1.15.1-Build45.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:05:50.385366Z] [info] [#2198558] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk.x86_64-1.42.1-Build43.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk.x86_64-1.42.1-Build43.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:06:10.606876Z] [info] [#2198564] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-lvm.x86_64-1.42.1-Build43.5.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-lvm.x86_64-1.42.1-Build43.5.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:07:40.729478Z] [info] [#2198569] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-overlayroot.x86_64-1.42.1-Build43.5.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-overlayroot.x86_64-1.42.1-Build43.5.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:07:59.379445Z] [info] [#2198567] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-orthos.x86_64-1.13.1-Build46.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-orthos.x86_64-1.13.1-Build46.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:11:20.036412Z] [info] [#2198571] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-suse-on-dnf.x86_64-1.42.1-Build42.5.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-suse-on-dnf.x86_64-1.42.1-Build42.5.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:15:35.959663Z] [info] [#2198573] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-raid.x86_64-1.42.1-Build38.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-raid.x86_64-1.42.1-Build38.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:17:15.524270Z] [info] [#2198576] Extracting "/var/lib/openqa/share/factory/tmp/openSUSE-Tumbleweed-1421-oem.x86_64.Build35.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/openSUSE-Tumbleweed-1421-oem.x86_64.Build35.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:18:24.529166Z] [info] [#2198579] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk-legacy.x86_64-1.42.1-Build50.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk-legacy.x86_64-1.42.1-Build50.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:21:35.557653Z] [info] [#2198581] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk-ramdisk.x86_64-1.42.1-Build45.5.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk-ramdisk.x86_64-1.42.1-Build45.5.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:34:08.087283Z] [info] [#2198612] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk.x86_64-1.15.3-Build42.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk.x86_64-1.15.3-Build42.1.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:35:21.746893Z] [info] [#2198609] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-custom-partitions.x86_64-1.15.3-Build42.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-custom-partitions.x86_64-1.15.3-Build42.1.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:37:38.947255Z] [info] [#2198614] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk-simple.x86_64-1.15.3-Build42.1.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk-simple.x86_64-1.15.3-Build42.1.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:37:54.305436Z] [info] [#2198618] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-lvm.x86_64-1.15.3-Build42.1.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-lvm.x86_64-1.15.3-Build42.1.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:39:15.705827Z] [info] [#2198622] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-disk-ramdisk.x86_64-1.15.3-Build42.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-disk-ramdisk.x86_64-1.15.3-Build42.1.raw"
/var/log/openqa_gru.14.xz:[2023-01-24T11:39:29.866499Z] [info] [#2198620] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-overlayroot.x86_64-1.15.3-Build42.1.vmdk.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-overlayroot.x86_64-1.15.3-Build42.1.vmdk"
/var/log/openqa_gru.14.xz:[2023-01-24T11:43:25.229599Z] [info] [#2198628] Extracting "/var/lib/openqa/share/factory/tmp/kiwi-test-image-live-disk.x86_64-2.0.0-Disk-Build46.1.raw.xz" to "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk.x86_64-2.0.0-Disk-Build46.1.raw"

However, none of these messages can be related because they were only logged slightly after the errors from the ticket description. Besides, the kiwi-related jobs all worked and that would only leave a single other log message but we have multiple occurrences.

Actions #9

Updated by mkittler almost 2 years ago

No, there was no alert, but ideally the gru logs should only contain "known" error messages.

That makes no sense. Do you mean "unknown" error messages instead. Otherwise, how would we ever learn about new problems then?

… if we accept them as things that can happen.

Considering the error should actually show up as reason in the info box of affected jobs that "if" should be the case. The only thing that makes me reluctant to drop the logging is that I cannot find the related jobs in this particular case.

Actions #10

Updated by mkittler almost 2 years ago

The concrete error comes from line 463 in /usr/lib/perl5/vendor_perl/5.36.0/Archive/Extract.pm (the concrete path/line might slightly differ on other systems). This code basically determines the working directory via Cwd::cwd, changes it to y via Perl's chdir built-in to the extraction target, does the extraction and then changes the working directory back via chdir again. The error message means that the last step is failing. That's quite strange, why would the original working directory be missing? Maybe the working directory was messed with before and this error is just a symptom of a previous problem.

Actions #11

Updated by mkittler almost 2 years ago

The problem has also happened on the 18.:

martchus@ariel:~> sudo journalctl -u openqa-gru | grep -i 'chdir'
Jan 18 07:42:39 ariel openqa-gru[5763]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.
…
Jan 18 08:25:15 ariel openqa-gru[4990]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.
Jan 24 11:08:40 ariel openqa-gru[19289]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.
…
Jan 24 12:09:44 ariel openqa-gru[2737]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.

So the dates when it happened correspond to the dates when the last two kiwi builds where scheduled (Tumbleweed-Build20230124 and Tumbleweed-Build20230118). The corresponding times are also matching. So it has very likely something to do with downloads triggered for these builds - even though the jobs have all passed/softfailed and the image downloading generally seemed ok as stated in #123175#note-17. (Note that when following logs yesterday I only checked the logfile as I was unaware that the journal also contains log messages of particular Minion jobs.)

Actions #12

Updated by tinita almost 2 years ago

mkittler wrote:

No, there was no alert, but ideally the gru logs should only contain "known" error messages.

That makes no sense. Do you mean "unknown" error messages instead. Otherwise, how would we ever learn about new problems then?

I meant known error message.
According to Oli it is acceptable to have "grep was killed ..." errors in the logs.
Only when error messages appear that we don't "know" then we have to act.
I don't know how to specify it more clearly, but there was a ticket about this in the past.

It doesn't mean I agree with this, that's just what I was told.

Actions #13

Updated by mkittler almost 2 years ago

I was wondering whether it would make sense to add a bit more context to these error messages. However, actually there should already be more context. Considering $log->error(qq{Extracting "$tempfile" failed: $_}); these error messages should start with Extracting …. However, it instead looks like the surrounding try … catch-block is not effective. That also fits the fact that the error message is not propagated as reason. However, then the Minion job should end up failed.

Actions #14

Updated by mkittler almost 2 years ago

  • Status changed from New to In Progress
Actions #15

Updated by mkittler almost 2 years ago

One can start a download job with extraction manually like this:

sudo -u geekotest /usr/share/openqa/script/openqa minion job -e download_asset -a '["http://download.opensuse.org/repositories/Virtualization:/Appliances:/Images:/Testing_x86:/archlinux/images/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw.xz", "/var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw", 1]'

So I've just did that. Maybe if I schedule many of such downloads in a loop I can reproduce the problem.

It took me a few attempts to get it right so the log message Cannot determine file type for … is from those attempts and can be ignored.

Actions #16

Updated by mkittler almost 2 years ago

Looks like there's no loop required, I've just provoked:

Jan 25 16:53:50 ariel openqa-gru[29298]: Could not chdir back to start dir '': No such file or directory' at /usr/share/openqa/script/../lib/OpenQA/Downloader.pm line 102.

When invoking the same command locally it works.

But on o3 it seems always reproducible (as long as one makes sure the target file doesn't already exists). The job still counts as finished (not failed) and it contains no¹ information about the problem. That explains why we haven't seen jobs with that reason. Additionally, despite the error the file actually exists in the filesystem and has been extracted correctly:

martchus@ariel:~> du -h /var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw
1.7G    /var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw
martchus@ariel:~> sha256sum /var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw
f32d0346d92e915abfaefbe11d8e08fa23bb2b2da2db9e3865b733d4a6f83358  /var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw

I think the explanation for all of this is that the log message is actually printed by Archive::Extract itself (in a way that makes it still point to our code). And Archive::Extract::extract actually returns a truthy return code despite the problem. So the error handling on our side doesn't kick in. However, considering that the archive actually could be extracted (only restoring the working dir didn't work) this is actually ok. The easiest way to get rid of the problem is likely to simply invoke the gru service with an explicit working directory that will always exist. Just setting WorkingDirectory=/var/lib/openqa in the service file doesn't help. That also mean that this log message is not an explanation for the zero-sized downloads from #123175.


¹

---
args:
- http://download.opensuse.org/repositories/Virtualization:/Appliances:/Images:/Testing_x86:/archlinux/images/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw.xz
- /var/lib/openqa/share/factory/hdd/kiwi-test-image-live-disk-kis.x86_64-1.0.0-Disk-Build56.10.raw
- 1
attempts: 1
children: []
created: 2023-01-25T17:06:44.223148Z
delayed: 2023-01-25T17:06:44.223148Z
expires: ~
finished: 2023-01-25T17:08:24.836001Z
id: 2203936
lax: 0
notes: {}
parents: []
priority: 0
queue: default
result: ~
retried: ~
retries: 0
started: 2023-01-25T17:06:44.243084Z
state: finished
task: download_asset
time: 2023-01-25T17:08:34.389121Z
worker: 1395
Actions #17

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-02-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by okurz almost 2 years ago

  • Subject changed from Error message in gru logs: Could not chdir back to start dir '' to Error message in gru logs: Could not chdir back to start dir '' size:M
  • Description updated (diff)
Actions #20

Updated by mkittler almost 2 years ago

  • Status changed from In Progress to Feedback

The PR is ready now. I had to change the approach for fixing. See comments on the PR for further details. TL;DR: It was apparmor.

Actions #21

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Resolved

The PR has been merged and deployed on o3:

martchus@ariel:~> sudo -u geekotest /usr/share/openqa/script/openqa eval -V 'use Cwd; "cwd: " . cwd'
"cwd: /home/martchus"

So it works now.

Actions

Also available in: Atom PDF