Project

General

Profile

action #104136

"archive_job_results" fail with "Unable to copy ... File exists" size:M

Added by okurz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-12-17
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

From https://openqa.suse.de/minion/jobs?id=3630519

---
args:
- 7219499
attempts: 1
children: []
created: 2021-12-16T18:47:41.3699Z
delayed: 2021-12-17T00:09:01.26733Z
expires: 2021-12-18T18:47:41.3699Z
finished: 2021-12-17T00:09:02.89443Z
id: 3630519
lax: 0
notes:
  gru_id: 30744450
parents: []
priority: 0
queue: default
result: |
  Unable to copy '/var/lib/openqa/testresults/07219/07219499-sle-15-SP4-Online-x86_64-Build39.1-wayland-desktopapps-gnome@64bit-virtio-vga' to '/var/lib/openqa/archive/testresults/07219/07219499-sle-15-SP4-Online-x86_64-Build39.1-wayland-desktopapps-gnome@64bit-virtio-vga': File exists at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 270.
retried: 2021-12-17T00:08:01.26733Z
retries: 308
started: 2021-12-17T00:09:02.73895Z
state: failed
task: archive_job_results
time: 2021-12-17T07:20:16.09821Z
worker: 588

The source code line reported is https://github.com/os-autoinst/openQA/blob/cbc8674aac26e1e260b536aae72cbe0ed8ca9064/lib/OpenQA/Schema/Result/Jobs.pm#L270 , context:

    my $archived_result_dir = $self->add_result_dir_prefix($self->remove_result_dir_prefix($normal_result_dir), 1);
    if (!File::Copy::Recursive::dircopy($normal_result_dir, $archived_result_dir)) {
        my $error = $!;
        File::Path::rmtree($archived_result_dir);    # avoid leftovers
-->     die "Unable to copy '$normal_result_dir' to '$archived_result_dir': $error";
    }

The directory /var/lib/openqa/testresults/07219/07219499-sle-15-SP4-Online-x86_64-Build39.1-wayland-desktopapps-gnome@64bit-virtio-vga/ still exists and has what looks like valid content. The directory /var/lib/openqa/archive/testresults/07219/ exists and currently has four other folders, but no folder for 07219499

Steps to reproduce

see
https://openqa.suse.de/minion/jobs?task=archive_job_results&state=failed
for more similar errors

Suggestions

  • Investigate why the archive already exists - maybe the previous archive failed but wasn't removed
  • Remove any left-overs / override existing files
  • Maybe the job was cancelled and re-run?
  • Document why the job failed so we understand it better
  • Maybe a good task to learn about the archiving code

Related issues

Related to openQA Project - action #92788: Use openQA archiving feature on osd size:SResolved

History

#1 Updated by okurz 7 months ago

  • Related to action #92788: Use openQA archiving feature on osd size:S added

#2 Updated by cdywan 7 months ago

  • Subject changed from "archive_job_results" fail with "Unable to copy ... File exists at lib/OpenQA/Schema/Result/Jobs.pm line 270." to "archive_job_results" fail with "Unable to copy ... File exists" size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by mkittler 7 months ago

  • Assignee set to mkittler

#4 Updated by mkittler 7 months ago

  • Status changed from Workable to In Progress

I somehow doubt that the error message "File exists" is correct:

  • The recursive copy function we use would actually write in an existing directory and also override existing files.
  • None of these directories actually exist at this point. (Ok, the cleanup to avoid leftovers would have removed these files.)
  • There are also failures (with the same error message) which haven't been retried yet¹. So it would not know why the directory should have been existed when the job was running.

¹

args:
- 7674401
attempts: 1
children: []
created: 2021-12-16T19:44:12.52169Z
delayed: 2021-12-16T19:44:12.52169Z
expires: 2021-12-18T19:44:12.52169Z
finished: 2021-12-17T00:57:05.53159Z
id: 3632392
lax: 0
notes:
  gru_id: 30746323
parents: []
priority: 0
queue: default
result: |
  Unable to copy '/var/lib/openqa/testresults/07674/07674401-sle-15-SP4-Online-s390x-Build63.1-xfstests_btrfs-generic@s390x-kvm-sle15' to '/var/lib/openqa/archive/testresults/07674/07674401-sle-15-SP4-Online-s390x-Build63.1-xfstests_btrfs-generic@s390x-kvm-sle15': File exists at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 270.
retried: ~
retries: 0
started: 2021-12-17T00:57:05.44191Z
state: failed
task: archive_job_results
time: 2021-12-17T12:17:12.49401Z
worker: 588

#5 Updated by mkittler 7 months ago

I retired jobs on OSD and couldn't pin-down the issue. Even if the directory already exists it works (as mentioned in the previous comment). If the target already exists but is a file I get

  Unable to copy '/var/lib/openqa/testresults/07630/07630513-sle-15-SP4-Online-ppc64le-Build61.1-create_hdd_sles4sap_gnome@ppc64le-2g' to '/var/lib/openqa/archive/testresults/07630/07630513-sle-15-SP4-Online-ppc64le-Build61.1-create_hdd_sles4sap_gnome@ppc64le-2g': Not a directory at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/Jobs.pm line 270.

which is a different error (and besides leaves the question why there would be a file).


I can provoke the error message by creating a file (e.g. foo9) and then specifying a directory within as destination:

  DB<18> $res = File::Copy::Recursive::dircopy("foo1", "foo9/foo10");

  DB<19> print "error: $!";
error: File exists

But that would mean e.g. /var/lib/openqa/archive/testresults/07674 was a file in our case. Or: The the copying code noticed that the directory was missing, then a parallel job created it, then the copying code tried to create the directory but ran into the error because it now already exists. If that's correct we could solve the problem by simply creating the common-prefix directory (e.g. 07674) ourselves before invoking dircopy.

#6 Updated by tinita 7 months ago

Weird. Maybe we could add some debugging like:

if ($! == 17) {
    # EEXISTS
    $errmsg .= qx{ls -l $dir_we_are_interested_in}
}

#7 Updated by mkittler 7 months ago

Looks like the code in File::Copy::Recursive is indeed broken. The module implements its own function to make a path (pathmk) and it does it like this:

mkdir($pth,$DirPerms) or return if !-d $pth && !$nofatal;

So if the directory doesn't exists when checking it but exists when trying to create it we'd get exactly the error we see¹ and since we indeed run those jobs concurrently that seems very possible.

This PR should fix it: https://github.com/os-autoinst/openQA/pull/4416


¹

  DB<29> print mkdir foo-new;
1
  DB<30> print mkdir foo-new;
0
  DB<31> print "error: $!";
error: File exists

#8 Updated by kraih 7 months ago

mkittler wrote:

Looks like the code in File::Copy::Recursive is indeed broken. The module implements its own function to make a path (pathmk) and it does it like this:

mkdir($pth,$DirPerms) or return if !-d $pth && !$nofatal;

So if the directory doesn't exists when checking it but exists when trying to create it we'd get exactly the error we see¹ and since we indeed run those jobs concurrently that seems very possible.

This PR should fix it: https://github.com/os-autoinst/openQA/pull/4416

Yes, it should. Mojo::File reuses the core Perl functionality which can handle parallel path creation. https://metacpan.org/dist/File-Path/source/lib/File/Path.pm#L232

#9 Updated by mkittler 7 months ago

  • Status changed from In Progress to Feedback

Then we should be good. The PR has been merged as well and I retried the remaining failed jobs (which succeeded now).

#10 Updated by okurz 6 months ago

  • Status changed from Feedback to Resolved

https://openqa.suse.de/minion/jobs?task=archive_job_results looks sane. There don't seem to be any archiving jobs the past days. I guess that's expected because the first bunch was certainly the biggest and now we have more space for now.

Also available in: Atom PDF