action #17788: [tools]Uploading images chksum check relies on global /var/lib/openqa/share - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #17788

closed

[tools]Uploading images chksum check relies on global /var/lib/openqa/share

Added by coolo about 8 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

EDiGiacinto

Category:

Regressions/Crashes

Target version:

Done

Start date:

2017-03-19

Due date:

% Done:

Estimated time:

Description

The webui just defines the path and the worker calls cksum on it - so this is broken for multiple webuis with or without NFS.

We did that to avoid the webui becoming a bottleneck calculating chksums, but possibly we have to bite the bullet and have it calculate checksums of uploaded chunks,
so it's not blocking for long.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by RBrownSUSE about 8 years ago

Subject changed from Uploading images chksum check relies on global /var/lib/openqa/share to [tools]Uploading images chksum check relies on global /var/lib/openqa/share

Actions

Copy link

Updated by okurz about 8 years ago

Category set to Regressions/Crashes

Actions

Copy link

Updated by coolo over 7 years ago

Target version set to Ready

The prio of this is a bit higher than normal, but not High :)

Actions

Copy link

Updated by szarate over 7 years ago

I'm +1 to getting this one fixed, It's quite a pain, specially when someone is trying to come up with a docker image and forgets about the part of the asset uploading :)

Actions

Copy link

Updated by AdamWill over 7 years ago

For the record, I recently talked to coolo about this because I'm looking into the viability of hosting workers "further away" in network terms from the server (i.e. In The Cloud). This setup is a bit of a problem for that, in a few ways.

It'd be much easier to do this in general of course if you didn't have to worry about setting up the shared filesystem on worker hosts at all. Needing it at all means you have to figure out a way to make it fly when you can't just stick up an unauthenticated NFS share since all the systems are on a private network anyway. It can be made to go with sshfs, at least according to my preliminary tests, but it's extra setup work and obviously potential fragility. I'm not sure if this is the only remaining case where the worker process may use the shared filesystem or if there are others (please point me at any others you know about!), but it's certainly one of them.

Aside from that, specifically in this case it effectively doubles the network traffic required for an asset transfer, because the worker uploads the asset to the server and then checksums the uploaded file, which effectively means it re-downloads the file via whatever the shared filesystem protocol is (assuming the server is hosting the share). So if you're uploading a 2GB asset you do 2GB of transfer from the worker to the server, then 2GB of transfer back from the server to the worker. Which is obviously inefficient and, depending on the scenario, potentially expensive.

The problem of the server potentially getting overloaded with checksum work is a real one, though, and I hadn't thought about it till coolo pointed it out...

Actions

Copy link

Updated by szarate over 7 years ago

I think this is almost the only case remaining of NFS, but it would have to be a generic checksum check for most files, and/or add verification of the upload in progress of some sort.

The worker can simply calculate the checksum and send it to the webUI along with the progress.

Actions

Copy link

Updated by coolo over 7 years ago

but who verifies the upload? we can of course also validate it on download into the cache - by every worker.

And no, it's not the only case. we also have use cases of 'other' assets delivered from share/openqa - and use them e.g. in autoyast installations. But we can find a way without them later.

Actions

Copy link

Updated by AdamWill over 7 years ago

I sorta felt like there must be some sort of standard way to do a 'verified' upload over HTTP (where both ends check chunks as they are transferred, or something like that), but at least from a cursory poke around Teh Intarwebz this morning, I couldn't find one.

Are 'other' assets not cached, on workers with caching enabled?

BTW, the other big roadblock I came up on was tests/needles; yes, there is caching for these, but it requires rsync (i.e. ssh), which is effectively the same as requiring a shared filesystem if you were going to use sshfs for the shared filesystem anyway. You can punch the necessary firewall holes and so on for a 'remote' worker, but it's a pain point (how much of one depends on the details of your setup, I guess).

A goofy solution I thought of for this would be to use gitfs instead of a 'traditional' shared filesystem for the distri, but a 'proper' solution might be to make it possible to define the DISTRI as a specific commit in a git repository; worker hosts would cache distri repos as they came across them, and use a temporarily-stored 'git archive' of the specified commit for each job, or something like that.

Actions

Copy link

Updated by szarate over 7 years ago

@coolo yeah, I keep forgetting about the 'others' assets lol... anywho:

but who verifies the upload? we can of course also validate it on download into the cache - by every worker.

The webUI. We can split a large file upload in small chunks configurable (say 10MB/100MB), worker calculates a checksum for said chunk adding a header to the transaction, Once worker marks a chunk as the last one, webUI verifies checksum, tells the worker and everybody is happy...

This could allow to save time in case of network errors.

Crazy idea, but who knows!

Actions

Copy link

#10

Updated by szarate over 7 years ago

This is doing something like that: https://gist.github.com/jberger/4744482

Actions

Copy link

#11

Updated by okurz over 7 years ago

szarate wrote:

The webUI. We can split a large file upload in small chunks configurable (say 10MB/100MB), worker calculates a checksum for said chunk adding a header to the transaction, Once worker marks a chunk as the last one, webUI verifies checksum, tells the worker and everybody is happy...

I wonder if this provides any benefit to letting the webui just calculate the checksum over the whole large file whenever it's completed. Leave the chunking responsibility to an efficient checksumming algorithm rather than trying to roll your own.

Actions

Copy link

#12

Updated by coolo over 7 years ago

Yeah, because mojolicious is nonblocking - your requests are supposed to be easy to digest.

Actions

Copy link

#13

Updated by szarate over 7 years ago

Related to action #30352: Disk upload failures added

Actions

Copy link

#14

Updated by szarate over 7 years ago

Blocks action #27955: Allow the worker_bridge to sync job status from a slaveUI to a masterUI added

Actions

Copy link

#15

Updated by szarate over 7 years ago

Assignee set to EDiGiacinto
Target version changed from Ready to Current Sprint

Moving this task to current sprint, since it's part of poo#27955

Actions

Copy link

#16

Updated by EDiGiacinto over 7 years ago

Status changed from New to In Progress

Proposal is here: https://github.com/os-autoinst/openQA/pull/1564 - it's being currently tested.
It will follow (in a different PR, as second step) a separate channel for downloading/uploading chunked files with OpenQA::Client

Actions

Copy link

#17

Updated by szarate over 7 years ago

Status changed from In Progress to Resolved

This has been deployed and is solved... only "Other" assets is pending, if it arises

Actions

Copy link

#18

Updated by szarate over 7 years ago

Target version changed from Current Sprint to Done

Actions

Copy link

#19

Updated by coolo about 7 years ago

Has duplicate action #32284: Test incompletes because the publishing of the HDD fails added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #17788

[tools]Uploading images chksum check relies on global /var/lib/openqa/share

Updated by RBrownSUSE about 8 years ago

Updated by okurz about 8 years ago

Updated by coolo over 7 years ago

Updated by szarate over 7 years ago

Updated by AdamWill over 7 years ago

Updated by szarate over 7 years ago

Updated by coolo over 7 years ago

Updated by AdamWill over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by okurz over 7 years ago

Updated by coolo over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by EDiGiacinto over 7 years ago

Updated by szarate over 7 years ago

Updated by szarate over 7 years ago

Updated by coolo about 7 years ago