Project

General

Profile

action #17788

[tools]Uploading images chksum check relies on global /var/lib/openqa/share

Added by coolo over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2017-03-19
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

The webui just defines the path and the worker calls cksum on it - so this is broken for multiple webuis with or without NFS.

We did that to avoid the webui becoming a bottleneck calculating chksums, but possibly we have to bite the bullet and have it calculate checksums of uploaded chunks,
so it's not blocking for long.


Related issues

Related to openQA Project - action #30352: Disk upload failuresResolved2018-01-15

Has duplicate openQA Project - action #32284: Test incompletes because the publishing of the HDD failsResolved2018-02-26

Blocks openQA Project - action #27955: Allow the worker_bridge to sync job status from a slaveUI to a masterUIRejected2017-11-20

History

#1 Updated by RBrownSUSE over 4 years ago

  • Subject changed from Uploading images chksum check relies on global /var/lib/openqa/share to [tools]Uploading images chksum check relies on global /var/lib/openqa/share

#2 Updated by okurz over 4 years ago

  • Category set to Concrete Bugs

#3 Updated by coolo almost 4 years ago

  • Target version set to Ready

The prio of this is a bit higher than normal, but not High :)

#4 Updated by szarate almost 4 years ago

I'm +1 to getting this one fixed, It's quite a pain, specially when someone is trying to come up with a docker image and forgets about the part of the asset uploading :)

#5 Updated by AdamWill almost 4 years ago

For the record, I recently talked to coolo about this because I'm looking into the viability of hosting workers "further away" in network terms from the server (i.e. In The Cloud). This setup is a bit of a problem for that, in a few ways.

It'd be much easier to do this in general of course if you didn't have to worry about setting up the shared filesystem on worker hosts at all. Needing it at all means you have to figure out a way to make it fly when you can't just stick up an unauthenticated NFS share since all the systems are on a private network anyway. It can be made to go with sshfs, at least according to my preliminary tests, but it's extra setup work and obviously potential fragility. I'm not sure if this is the only remaining case where the worker process may use the shared filesystem or if there are others (please point me at any others you know about!), but it's certainly one of them.

Aside from that, specifically in this case it effectively doubles the network traffic required for an asset transfer, because the worker uploads the asset to the server and then checksums the uploaded file, which effectively means it re-downloads the file via whatever the shared filesystem protocol is (assuming the server is hosting the share). So if you're uploading a 2GB asset you do 2GB of transfer from the worker to the server, then 2GB of transfer back from the server to the worker. Which is obviously inefficient and, depending on the scenario, potentially expensive.

The problem of the server potentially getting overloaded with checksum work is a real one, though, and I hadn't thought about it till coolo pointed it out...

#6 Updated by szarate almost 4 years ago

I think this is almost the only case remaining of NFS, but it would have to be a generic checksum check for most files, and/or add verification of the upload in progress of some sort.

The worker can simply calculate the checksum and send it to the webUI along with the progress.

#7 Updated by coolo almost 4 years ago

but who verifies the upload? we can of course also validate it on download into the cache - by every worker.

And no, it's not the only case. we also have use cases of 'other' assets delivered from share/openqa - and use them e.g. in autoyast installations. But we can find a way without them later.

#8 Updated by AdamWill almost 4 years ago

I sorta felt like there must be some sort of standard way to do a 'verified' upload over HTTP (where both ends check chunks as they are transferred, or something like that), but at least from a cursory poke around Teh Intarwebz this morning, I couldn't find one.

Are 'other' assets not cached, on workers with caching enabled?

BTW, the other big roadblock I came up on was tests/needles; yes, there is caching for these, but it requires rsync (i.e. ssh), which is effectively the same as requiring a shared filesystem if you were going to use sshfs for the shared filesystem anyway. You can punch the necessary firewall holes and so on for a 'remote' worker, but it's a pain point (how much of one depends on the details of your setup, I guess).

A goofy solution I thought of for this would be to use gitfs instead of a 'traditional' shared filesystem for the distri, but a 'proper' solution might be to make it possible to define the DISTRI as a specific commit in a git repository; worker hosts would cache distri repos as they came across them, and use a temporarily-stored 'git archive' of the specified commit for each job, or something like that.

#9 Updated by szarate almost 4 years ago

coolo yeah, I keep forgetting about the 'others' assets lol... anywho:

but who verifies the upload? we can of course also validate it on download into the cache - by every worker.

The webUI. We can split a large file upload in small chunks configurable (say 10MB/100MB), worker calculates a checksum for said chunk adding a header to the transaction, Once worker marks a chunk as the last one, webUI verifies checksum, tells the worker and everybody is happy...

This could allow to save time in case of network errors.

Crazy idea, but who knows!

#10 Updated by szarate almost 4 years ago

This is doing something like that: https://gist.github.com/jberger/4744482

#11 Updated by okurz almost 4 years ago

szarate wrote:

The webUI. We can split a large file upload in small chunks configurable (say 10MB/100MB), worker calculates a checksum for said chunk adding a header to the transaction, Once worker marks a chunk as the last one, webUI verifies checksum, tells the worker and everybody is happy...

I wonder if this provides any benefit to letting the webui just calculate the checksum over the whole large file whenever it's completed. Leave the chunking responsibility to an efficient checksumming algorithm rather than trying to roll your own.

#12 Updated by coolo almost 4 years ago

Yeah, because mojolicious is nonblocking - your requests are supposed to be easy to digest.

#13 Updated by szarate over 3 years ago

#14 Updated by szarate over 3 years ago

  • Blocks action #27955: Allow the worker_bridge to sync job status from a slaveUI to a masterUI added

#15 Updated by szarate over 3 years ago

  • Assignee set to EDiGiacinto
  • Target version changed from Ready to Current Sprint

Moving this task to current sprint, since it's part of poo#27955

#16 Updated by EDiGiacinto over 3 years ago

  • Status changed from New to In Progress

Proposal is here: https://github.com/os-autoinst/openQA/pull/1564 - it's being currently tested.
It will follow (in a different PR, as second step) a separate channel for downloading/uploading chunked files with OpenQA::Client

#17 Updated by szarate over 3 years ago

  • Status changed from In Progress to Resolved

This has been deployed and is solved... only "Other" assets is pending, if it arises

#18 Updated by szarate over 3 years ago

  • Target version changed from Current Sprint to Done

#19 Updated by coolo over 3 years ago

  • Has duplicate action #32284: Test incompletes because the publishing of the HDD fails added

Also available in: Atom PDF