Project

General

Profile

coordination #89278

[epic] Concurrent unnecessary uploads from investigation jobs corrupt qcow images

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Workable
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2021-03-01
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Observation

As brought up by GraceWang "did anybody face the problem that the qcow2 in osd got damaged? What I mean damaged is that the qcow2 type is changed and then can not boot to desktop https://openqa.nue.suse.com/tests/5564738#step/boot_to_desktop/1"
we don't calculate a checksum for the generated qcow images so we don't know any initial hash. The image worked to boot initially so it seems like it was damaged while it was just sitting on osd which of course is weird.

okurz@openqa:~> ls -lh /var/lib/openqa/share/factory/hdd/SLE-15-SP3-x86_64-Build156.3-sled-gnome.qcow2
-rw-r--r-- 1 geekotest nogroup 2.5G Feb 27 13:34 /var/lib/openqa/share/factory/hdd/SLE-15-SP3-x86_64-Build156.3-sled-gnome.qcow2
okurz@openqa:~> file /var/lib/openqa/share/factory/hdd/SLE-15-SP3-x86_64-Build156.3-sled-gnome.qcow2
/var/lib/openqa/share/factory/hdd/SLE-15-SP3-x86_64-Build156.3-sled-gnome.qcow2: data

https://openqa.nue.suse.com/tests/5555304#dependencies is the original job that created the qcow image at 2021-02-26 19:47:35 +0000, from log:

[2021-02-26T20:46:35.0065 CET] [info] [pid:17912] SLE-15-SP3-x86_64-Build156.3-sled-gnome.qcow2: Processing chunk 2602/2602, avg. speed ~440.938 KiB/s

so as visible from the timestamps the qcow image was clearly modified multiple hours after the initial job uploaded the image. https://openqa.nue.suse.com/tests/5555310# is the first job within the scenario sle-15-SP3-Online-x86_64-wayland-desktopapps-documentation@64bit-virtio-vga for build 156.3 and it was able to boot the qcow image correctly, later tests in this scenario then failed to find a valid qcow image.

Problem

Looking into the database I can find:

openqa=# select jobs.id,t_finished,state,result,result_dir from jobs,job_settings where jobs.id = job_settings.job_id and key = 'PUBLISH_HDD_1' and value = 'SLE-15-SP3-x86_64-Build156.3-sled-gnome.qcow2' order by t_finished;
   id    |     t_finished      |  state  | result |                                                                  result_dir                                                             

---------+---------------------+---------+--------+-----------------------------------------------------------------------------------------------------------------------------------------
------
 5554094 | 2021-02-26 18:25:43 | done    | failed | 05554094-sle-15-SP3-Online-x86_64-Build156.3-create_hdd_sled_gnome@64bit-virtio-vga
 5555304 | 2021-02-26 19:47:35 | done    | passed | 05555304-sle-15-SP3-Online-x86_64-Build156.3-create_hdd_sled_gnome@64bit-virtio-vga
 5555252 | 2021-02-27 12:27:22 | done    | passed | 05555252-sle-15-SP3-Online-x86_64-create_hdd_sled_gnome:investigate:retry@64bit-virtio-vga
 5555253 | 2021-02-27 12:35:54 | done    | passed | 05555253-sle-15-SP3-Online-x86_64-create_hdd_sled_gnome:investigate:last_good_tests:a24328b7173f02f5c4cf54f74f6799abcc9ff839@64bit-virti
o-vga

so I suspect that the two "investigate:*" job overwrote the qcow image at the same time. The logs

confirm that as an upload happens to the same file during the same time period.

Acceptance criteria

  • AC1: jobs are prevented from uploading to the same asset file during the same time
  • AC2: Assets with incorrect checksum are not used in other jobs
  • AC3: Investigation jobs don't override production assets

Suggestions

  • Calculate checksum of big assets on worker side before uploading
  • Upload checksum along with big assets
  • Check checksum on webUI side after receiving upload
  • Prevent multiple concurrent uploads to the same file, e.g. abort any additional job as incomplete when trying to upload to the same file
  • Prevent investigation jobs to do any asset uploads

Workaround

Retrigger the parent image creation job


Subtasks

action #89281: Prevent investigation jobs to do any asset uploads to prevent overriding production assetsResolvedmkittler

History

#1 Updated by okurz about 2 months ago

  • Subject changed from qcow image becomes corrupted while it is sitting in storage to [epic] Concurrent unnecessary uploads from investigation jobs corrupt qcow images
  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to High
  • Target version changed from future to Ready

#2 Updated by okurz about 2 months ago

  • Tracker changed from action to coordination

#3 Updated by okurz about 2 months ago

  • Category changed from Concrete Bugs to Feature requests
  • Status changed from Workable to Blocked
  • Target version changed from Ready to future

one subtask to track

#4 Updated by okurz about 2 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

subtask done, rest at any later time.

Also available in: Atom PDF