Project

General

Profile

Actions

tickets #124802

open

RFC: continue to permit robots access to redmine ?

Added by pjessen about 1 year ago. Updated 4 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
2023-02-20
Due date:
% Done:

0%

Estimated time:

Description

while I was researching another issue, I couldn't help noticing that robots access redmine too, subject to their compliance with robots.txt of course. Do we really want that? I mean, although some issue might be public, do we really want it indexed everywhere?

Actions #1

Updated by pjessen about 1 year ago

  • Private changed from Yes to No
Actions #2

Updated by luc14n0 about 1 year ago

Just some questions, so I can catch up.

Do we know what those robots are doing exactly?
You mentioned indexing, are those robots indexing the tickets somewhere then?
And if so do we know where?

Actions #3

Updated by pjessen about 1 year ago

luc14n0 wrote:

Just some questions, so I can catch up.
Do we know what those robots are doing exactly?

They are retrieving various pages, searches, issues and such. It is all in the nginx logs. Here is what Applebot looked at yesterday:

192.168.47.21 - - [21/Feb/2023:03:02:58 +0000] "GET /robots.txt HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:03:02:59 +0000] "GET /issues/88780 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:03:26:39 +0000] "GET /issues/90041 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:03:31:44 +0000] "GET /users/32201 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:04:32:41 +0000] "GET /issues/20 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:06:09:12 +0000] "GET /issues/97046 HTTP/1.1"

You mentioned indexing, are those robots indexing the tickets somewhere then?

Yes, almost certainly, that's their primary function, gather information and index it. Think Google.

And if so do we know where?

These are the bots I see in February, in no particular order:

SemrushBot
Slackbot
DotBot
bingbot
YandexBot
Googlebot
AhrefsBot
Amazonbot
Slack-ImgProxy
PetalBot
Applebot
Qwantify
SiteAuditBot
YaK
CCBot
SeznamBot
Synapse
Twitterbot
SummalyBot
linkdexbot
DuckDuckGo
serpstatbot
BitSightBot
DataForSeoBot
MojeekBot
LinkedInBot
RepoLookoutBot
Mail.RU_Bot
YandexFavicons
GozleBot
MJ12bot
coccocbot
Discu.eu
yacybot
TelegramBot
Sogou
YisouSpider

Some are well known, many a lot less so. It's the usual "crowd".

Note - we already have a long list of areas that are "blocked off" in robots.txt (https://progress.opensuse.org/robots.txt), so compliant robots will not go there.

Traffic-wise, December 2022 :

* 12463658 GET requests
* 11070582 by "python-requests", essentially some unidentified robot.
* 764984   by the robots above.

On average, we are serving 4.5 requests per second, that is quite impressive.

Actions #4

Updated by pjessen about 1 year ago

  • Status changed from New to Feedback

I am little bit surprised there has been so little feedback here. Maybe because I neglected to add a specific proposal.

  • I propose we block all robots. I simply do not see what purpose any of them serve in this context.
Actions #5

Updated by crameleon about 1 year ago

There are good robots as well. Search engine indexing in particular can be useful for people outside of the openSUSE community. I would assume bad robots are likely to not respect robots.txt anyways.

Actions #6

Updated by pjessen about 1 year ago

crameleon wrote:

There are good robots as well.

Certainly. I expect the vast majority are all "good".

Search engine indexing in particular can be useful for people outside of the openSUSE community.

Maybe, but then what is their business with our ticketing system ...

I would assume bad robots are likely to not respect robots.txt anyways.

Ditto.

Actions #7

Updated by crameleon 4 months ago

Maybe, but then what is their business with our ticketing system

We are also tracking infrastructure issues which could potentially be relevant to the wider community.

I think it would be good to make a curated list of robots which do not map to known search engines and serve it via an HAProxy rule for all our websites, not just Progress. Your list above seems to be a good start, minus Googlebot and yacybot.

Actions #8

Updated by pjessen 4 months ago

crameleon wrote in #note-7:

I think it would be good to make a curated list of robots which do not map to known search engines and serve it via an HAProxy rule for all our websites, not just Progress. Your list above seems to be a good start, minus Googlebot and yacybot.

I don't know what yacybot is, but if we want to include regular search engines, bingbot and duckduckgo are probably candidates too.

Actions #9

Updated by crameleon 4 months ago

YACY is an open, federated, search engine.

Do you want to make it a black- or a whitelist? I was thinking to blacklist and just leave such search engines out.

Actions #10

Updated by pjessen 4 months ago

crameleon wrote in #note-9:

YACY is an open, federated, search engine.

Do you want to make it a black- or a whitelist? I was thinking to blacklist and just leave such search engines out.

I think we have to blacklist - whitelist also means whitelisting a gazillion browser signatures, doesn't it? Even if by regex. We will still likely miss some, only creating more work for ourselves.

Actions #11

Updated by crameleon 4 months ago

I concur.

Actions

Also available in: Atom PDF