action #17788

[tools]Uploading images chksum check relies on global /var/lib/openqa/share

Added by coolo almost 3 years ago. Updated almost 2 years ago.

Status:ResolvedStart date:19/03/2017
Priority:NormalDue date:
Assignee:EDiGiacinto% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

The webui just defines the path and the worker calls cksum on it - so this is broken for multiple webuis with or without NFS.

We did that to avoid the webui becoming a bottleneck calculating chksums, but possibly we have to bite the bullet and have it calculate checksums of uploaded chunks,
so it's not blocking for long.


Related issues

Related to openQA Project - action #30352: Disk upload failures Resolved 15/01/2018
Duplicated by openQA Project - action #32284: Test incompletes because the publishing of the HDD fails Resolved 26/02/2018
Blocks openQA Project - action #27955: Allow the worker_bridge to sync job status from a slaveUI... Rejected 20/11/2017

History

#1 Updated by RBrownSUSE almost 3 years ago

  • Subject changed from Uploading images chksum check relies on global /var/lib/openqa/share to [tools]Uploading images chksum check relies on global /var/lib/openqa/share

#2 Updated by okurz almost 3 years ago

  • Category set to Concrete Bugs

#3 Updated by coolo over 2 years ago

  • Target version set to Ready

The prio of this is a bit higher than normal, but not High :)

#4 Updated by szarate over 2 years ago

I'm +1 to getting this one fixed, It's quite a pain, specially when someone is trying to come up with a docker image and forgets about the part of the asset uploading :)

#5 Updated by AdamWill about 2 years ago

For the record, I recently talked to coolo about this because I'm looking into the viability of hosting workers "further away" in network terms from the server (i.e. In The Cloud). This setup is a bit of a problem for that, in a few ways.

It'd be much easier to do this in general of course if you didn't have to worry about setting up the shared filesystem on worker hosts at all. Needing it at all means you have to figure out a way to make it fly when you can't just stick up an unauthenticated NFS share since all the systems are on a private network anyway. It can be made to go with sshfs, at least according to my preliminary tests, but it's extra setup work and obviously potential fragility. I'm not sure if this is the only remaining case where the worker process may use the shared filesystem or if there are others (please point me at any others you know about!), but it's certainly one of them.

Aside from that, specifically in this case it effectively doubles the network traffic required for an asset transfer, because the worker uploads the asset to the server and then checksums the uploaded file, which effectively means it re-downloads the file via whatever the shared filesystem protocol is (assuming the server is hosting the share). So if you're uploading a 2GB asset you do 2GB of transfer from the worker to the server, then 2GB of transfer back from the server to the worker. Which is obviously inefficient and, depending on the scenario, potentially expensive.

The problem of the server potentially getting overloaded with checksum work is a real one, though, and I hadn't thought about it till coolo pointed it out...

#6 Updated by szarate about 2 years ago

I think this is almost the only case remaining of NFS, but it would have to be a generic checksum check for most files, and/or add verification of the upload in progress of some sort.

The worker can simply calculate the checksum and send it to the webUI along with the progress.

#7 Updated by coolo about 2 years ago

but who verifies the upload? we can of course also validate it on download into the cache - by every worker.

And no, it's not the only case. we also have use cases of 'other' assets delivered from share/openqa - and use them e.g. in autoyast installations. But we can find a way without them later.

#8 Updated by AdamWill about 2 years ago

I sorta felt like there must be some sort of standard way to do a 'verified' upload over HTTP (where both ends check chunks as they are transferred, or something like that), but at least from a cursory poke around Teh Intarwebz this morning, I couldn't find one.

Are 'other' assets not cached, on workers with caching enabled?

BTW, the other big roadblock I came up on was tests/needles; yes, there is caching for these, but it requires rsync (i.e. ssh), which is effectively the same as requiring a shared filesystem if you were going to use sshfs for the shared filesystem anyway. You can punch the necessary firewall holes and so on for a 'remote' worker, but it's a pain point (how much of one depends on the details of your setup, I guess).

A goofy solution I thought of for this would be to use gitfs instead of a 'traditional' shared filesystem for the distri, but a 'proper' solution might be to make it possible to define the DISTRI as a specific commit in a git repository; worker hosts would cache distri repos as they came across them, and use a temporarily-stored 'git archive' of the specified commit for each job, or something like that.

#9 Updated by szarate about 2 years ago

@coolo yeah, I keep forgetting about the 'others' assets lol... anywho:

but who verifies the upload? we can of course also validate it on download into the cache - by every worker.

The webUI. We can split a large file upload in small chunks configurable (say 10MB/100MB), worker calculates a checksum for said chunk adding a header to the transaction, Once worker marks a chunk as the last one, webUI verifies checksum, tells the worker and everybody is happy...

This could allow to save time in case of network errors.

Crazy idea, but who knows!

#10 Updated by szarate about 2 years ago

This is doing something like that: https://gist.github.com/jberger/4744482

#11 Updated by okurz about 2 years ago

szarate wrote:

The webUI. We can split a large file upload in small chunks configurable (say 10MB/100MB), worker calculates a checksum for said chunk adding a header to the transaction, Once worker marks a chunk as the last one, webUI verifies checksum, tells the worker and everybody is happy...

I wonder if this provides any benefit to letting the webui just calculate the checksum over the whole large file whenever it's completed. Leave the chunking responsibility to an efficient checksumming algorithm rather than trying to roll your own.

#12 Updated by coolo about 2 years ago

Yeah, because mojolicious is nonblocking - your requests are supposed to be easy to digest.

#13 Updated by szarate about 2 years ago

#14 Updated by szarate about 2 years ago

  • Blocks action #27955: Allow the worker_bridge to sync job status from a slaveUI to a masterUI added

#15 Updated by szarate about 2 years ago

  • Assignee set to EDiGiacinto
  • Target version changed from Ready to Current Sprint

Moving this task to current sprint, since it's part of poo#27955

#16 Updated by EDiGiacinto about 2 years ago

  • Status changed from New to In Progress

Proposal is here: https://github.com/os-autoinst/openQA/pull/1564 - it's being currently tested.
It will follow (in a different PR, as second step) a separate channel for downloading/uploading chunked files with OpenQA::Client

#17 Updated by szarate almost 2 years ago

  • Status changed from In Progress to Resolved

This has been deployed and is solved... only "Other" assets is pending, if it arises

#18 Updated by szarate almost 2 years ago

  • Target version changed from Current Sprint to Done

#19 Updated by coolo almost 2 years ago

  • Duplicated by action #32284: Test incompletes because the publishing of the HDD fails added

Also available in: Atom PDF