Project

General

Profile

Actions

tickets #103773

closed

superfluous mirror scanning ? (apparently by mirrorcache)

Added by pjessen over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Mirrors
Target version:
-
Start date:
2021-12-09
Due date:
% Done:

100%

Estimated time:

Description

I have just been adding some more disks to our openSUSE mirror (http://mirror.hostsuisse.com/opensuse) and I happened to notice a lot of accesses from 195.135.221.151, aka scar.o.o . The User-Agent is "Mojolicious (Perl)" which apparently suggests this is being done by mirrorcache?

We don't mirror repositories (too big), but we are still been bombarded by requests (that all get a 404) - in the last 30 days, 1'471'058 requests, approx 50'000 per day. That can't be right?

Also, why only over IPv4?

olaf only uses IPv6 for mirror.hostsuisse.com, but only 45'934 accesses in the same period of time, i.e. 30 times less :-)

Actions #1

Updated by pjessen over 2 years ago

  • Private changed from Yes to No

Wrt the User-Agent, perhaps it might make sense to clearly identify as mirrorcache? The mirrorbrain scanner uses :

MirrorBrain Probe (see http://mirrorbrain.org/probe_info)

Something like that might be better.

Actions #2

Updated by andriinikitin over 2 years ago

  • Category set to Mirrors
  • Status changed from New to In Progress
  • % Done changed from 0 to 90

I agree that the claim is valid and number of requests can be optimized.

It has been addressed in MirrorCache 1.021 starting now : MirrorCache mirror_scan jobs will not attempt to scan individual folders on mirrors, which do not have root folder of project as defined in
https://github.com/openSUSE/MirrorCache/blob/master/dist/salt/profile/mirrorcache/files/usr/share/mirrorcache/sql/projects.sql

It will still try to check if a mirror has a project (e.g. /repositories) once per several minutes (this can be reduced further, but I am not sure if it is necessary).

You may still expect bigger number of requests from MirrorCache comparing to MirrorBrain, because it uses different approach : instead of doing tree scans (full or partial) on each mirror - a job does scans of individual folders (without subfolders) on all mirrors. Such approach has some disadvantages, but I believe that advantages are more bold:

  • it is easier to promptly react on new releases of OBS projects;
  • it is easier to diagnose and retry particular scans;
  • it allows to track only those locations which are actually in use. E.g. if users use /repositories/Apache/ only on TW and 15.3 , then other locations like SLE_15 , SLE_15.1 etc - will not be tracked by the redirector (until some user starts actually using it).

So a mirror should expect more reads from MirrorCache than from MirrorBrain, but those reads will be for single folder only (instead of recursive scan) and only for those folders which were requested by users in past 2 weeks (instead of all).

I will try to address the other questions next week (http vs https load on the mirror and more descriptive user-agent hint).

Actions #3

Updated by andriinikitin over 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Also, why only over IPv4?

Hard to tell, both mirrorcache.o.o and mirrorcache-eu.o.o can use only ipv4 addresses, but if a mirror uses ipv6, it should be redirected to the mirror properly. I can use some external help setting up config for the machines if you think that ipv6 is better for scanning.

Wrt the User-Agent, perhaps it might make sense to clearly identify as mirrorcache?

That's actually a good idea, this PR should explicitly set user-agent for MirrorCache, it should be deployed on Thursday https://github.com/openSUSE/MirrorCache/pull/240

Actions

Also available in: Atom PDF