action #104136
closed"archive_job_results" fail with "Unable to copy ... File exists" size:M
Description
Observation¶
From https://openqa.suse.de/minion/jobs?id=3630519
---
args:
- 7219499
attempts: 1
children: []
created: 2021-12-16T18:47:41.3699Z
delayed: 2021-12-17T00:09:01.26733Z
expires: 2021-12-18T18:47:41.3699Z
finished: 2021-12-17T00:09:02.89443Z
id: 3630519
lax: 0
notes:
gru_id: 30744450
parents: []
priority: 0
queue: default
result: |
Unable to copy '/var/lib/openqa/testresults/07219/07219499-sle-15-SP4-Online-x86_64-Build39.1-wayland-desktopapps-gnome@64bit-virtio-vga' to '/var/lib/openqa/archive/testresults/07219/07219499-sle-15-SP4-Online-x86_64-Build39.1-wayland-desktopapps-gnome@64bit-virtio-vga': File exists at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 270.
retried: 2021-12-17T00:08:01.26733Z
retries: 308
started: 2021-12-17T00:09:02.73895Z
state: failed
task: archive_job_results
time: 2021-12-17T07:20:16.09821Z
worker: 588
The source code line reported is https://github.com/os-autoinst/openQA/blob/cbc8674aac26e1e260b536aae72cbe0ed8ca9064/lib/OpenQA/Schema/Result/Jobs.pm#L270 , context:
my $archived_result_dir = $self->add_result_dir_prefix($self->remove_result_dir_prefix($normal_result_dir), 1);
if (!File::Copy::Recursive::dircopy($normal_result_dir, $archived_result_dir)) {
my $error = $!;
File::Path::rmtree($archived_result_dir); # avoid leftovers
--> die "Unable to copy '$normal_result_dir' to '$archived_result_dir': $error";
}
The directory /var/lib/openqa/testresults/07219/07219499-sle-15-SP4-Online-x86_64-Build39.1-wayland-desktopapps-gnome@64bit-virtio-vga/ still exists and has what looks like valid content. The directory /var/lib/openqa/archive/testresults/07219/ exists and currently has four other folders, but no folder for 07219499
Steps to reproduce¶
see
https://openqa.suse.de/minion/jobs?task=archive_job_results&state=failed
for more similar errors
Suggestions¶
- Investigate why the archive already exists - maybe the previous archive failed but wasn't removed
- Remove any left-overs / override existing files
- Maybe the job was cancelled and re-run?
- Document why the job failed so we understand it better
- Maybe a good task to learn about the archiving code
Updated by okurz over 2 years ago
- Related to action #92788: Use openQA archiving feature on osd size:S added
Updated by livdywan over 2 years ago
- Subject changed from "archive_job_results" fail with "Unable to copy ... File exists at lib/OpenQA/Schema/Result/Jobs.pm line 270." to "archive_job_results" fail with "Unable to copy ... File exists" size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
I somehow doubt that the error message "File exists" is correct:
- The recursive copy function we use would actually write in an existing directory and also override existing files.
- None of these directories actually exist at this point. (Ok, the cleanup to avoid leftovers would have removed these files.)
- There are also failures (with the same error message) which haven't been retried yet¹. So it would not know why the directory should have been existed when the job was running.
¹
args:
- 7674401
attempts: 1
children: []
created: 2021-12-16T19:44:12.52169Z
delayed: 2021-12-16T19:44:12.52169Z
expires: 2021-12-18T19:44:12.52169Z
finished: 2021-12-17T00:57:05.53159Z
id: 3632392
lax: 0
notes:
gru_id: 30746323
parents: []
priority: 0
queue: default
result: |
Unable to copy '/var/lib/openqa/testresults/07674/07674401-sle-15-SP4-Online-s390x-Build63.1-xfstests_btrfs-generic@s390x-kvm-sle15' to '/var/lib/openqa/archive/testresults/07674/07674401-sle-15-SP4-Online-s390x-Build63.1-xfstests_btrfs-generic@s390x-kvm-sle15': File exists at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 270.
retried: ~
retries: 0
started: 2021-12-17T00:57:05.44191Z
state: failed
task: archive_job_results
time: 2021-12-17T12:17:12.49401Z
worker: 588
Updated by mkittler over 2 years ago
I retired jobs on OSD and couldn't pin-down the issue. Even if the directory already exists it works (as mentioned in the previous comment). If the target already exists but is a file I get
Unable to copy '/var/lib/openqa/testresults/07630/07630513-sle-15-SP4-Online-ppc64le-Build61.1-create_hdd_sles4sap_gnome@ppc64le-2g' to '/var/lib/openqa/archive/testresults/07630/07630513-sle-15-SP4-Online-ppc64le-Build61.1-create_hdd_sles4sap_gnome@ppc64le-2g': Not a directory at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 270.
which is a different error (and besides leaves the question why there would be a file).
I can provoke the error message by creating a file (e.g. foo9
) and then specifying a directory within as destination:
DB<18> $res = File::Copy::Recursive::dircopy("foo1", "foo9/foo10");
DB<19> print "error: $!";
error: File exists
But that would mean e.g. /var/lib/openqa/archive/testresults/07674
was a file in our case. Or: The the copying code noticed that the directory was missing, then a parallel job created it, then the copying code tried to create the directory but ran into the error because it now already exists. If that's correct we could solve the problem by simply creating the common-prefix directory (e.g. 07674
) ourselves before invoking dircopy
.
Updated by tinita over 2 years ago
Weird. Maybe we could add some debugging like:
if ($! == 17) {
# EEXISTS
$errmsg .= qx{ls -l $dir_we_are_interested_in}
}
Updated by mkittler over 2 years ago
Looks like the code in File::Copy::Recursive
is indeed broken. The module implements its own function to make a path (pathmk
) and it does it like this:
mkdir($pth,$DirPerms) or return if !-d $pth && !$nofatal;
So if the directory doesn't exists when checking it but exists when trying to create it we'd get exactly the error we see¹ and since we indeed run those jobs concurrently that seems very possible.
This PR should fix it: https://github.com/os-autoinst/openQA/pull/4416
¹
DB<29> print mkdir foo-new;
1
DB<30> print mkdir foo-new;
0
DB<31> print "error: $!";
error: File exists
Updated by kraih over 2 years ago
mkittler wrote:
Looks like the code in
File::Copy::Recursive
is indeed broken. The module implements its own function to make a path (pathmk
) and it does it like this:mkdir($pth,$DirPerms) or return if !-d $pth && !$nofatal;
So if the directory doesn't exists when checking it but exists when trying to create it we'd get exactly the error we see¹ and since we indeed run those jobs concurrently that seems very possible.
This PR should fix it: https://github.com/os-autoinst/openQA/pull/4416
Yes, it should. Mojo::File
reuses the core Perl functionality which can handle parallel path creation. https://metacpan.org/dist/File-Path/source/lib/File/Path.pm#L232
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
Then we should be good. The PR has been merged as well and I retried the remaining failed jobs (which succeeded now).
Updated by okurz over 2 years ago
- Status changed from Feedback to Resolved
https://openqa.suse.de/minion/jobs?task=archive_job_results looks sane. There don't seem to be any archiving jobs the past days. I guess that's expected because the first bunch was certainly the biggest and now we have more space for now.