tickets #124802: RFC: continue to permit robots access to redmine ? - openSUSE admin - openSUSE Project Management Tool

Actions

Copy link

tickets #124802

closed

RFC: continue to permit robots access to redmine ?

Added by pjessen almost 2 years ago. Updated 6 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

2023-02-20

Due date:

% Done:

Estimated time:

Description

while I was researching another issue, I couldn't help noticing that robots access redmine too, subject to their compliance with robots.txt of course. Do we really want that? I mean, although some issue might be public, do we really want it indexed everywhere?

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by pjessen almost 2 years ago

Private changed from Yes to No

Actions

Copy link

Updated by luc14n0 almost 2 years ago

Just some questions, so I can catch up.

Do we know what those robots are doing exactly?
You mentioned indexing, are those robots indexing the tickets somewhere then?
And if so do we know where?

Actions

Copy link

Updated by pjessen almost 2 years ago

luc14n0 wrote:

Just some questions, so I can catch up.
Do we know what those robots are doing exactly?

They are retrieving various pages, searches, issues and such. It is all in the nginx logs. Here is what Applebot looked at yesterday:

192.168.47.21 - - [21/Feb/2023:03:02:58 +0000] "GET /robots.txt HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:03:02:59 +0000] "GET /issues/88780 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:03:26:39 +0000] "GET /issues/90041 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:03:31:44 +0000] "GET /users/32201 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:04:32:41 +0000] "GET /issues/20 HTTP/1.1"
192.168.47.21 - - [21/Feb/2023:06:09:12 +0000] "GET /issues/97046 HTTP/1.1"

You mentioned indexing, are those robots indexing the tickets somewhere then?

Yes, almost certainly, that's their primary function, gather information and index it. Think Google.

And if so do we know where?

These are the bots I see in February, in no particular order:

SemrushBot
Slackbot
DotBot
bingbot
YandexBot
Googlebot
AhrefsBot
Amazonbot
Slack-ImgProxy
PetalBot
Applebot
Qwantify
SiteAuditBot
YaK
CCBot
SeznamBot
Synapse
Twitterbot
SummalyBot
linkdexbot
DuckDuckGo
serpstatbot
BitSightBot
DataForSeoBot
MojeekBot
LinkedInBot
RepoLookoutBot
Mail.RU_Bot
YandexFavicons
GozleBot
MJ12bot
coccocbot
Discu.eu
yacybot
TelegramBot
Sogou
YisouSpider

Some are well known, many a lot less so. It's the usual "crowd".

Note - we already have a long list of areas that are "blocked off" in robots.txt (https://progress.opensuse.org/robots.txt), so compliant robots will not go there.

Traffic-wise, December 2022 :

* 12463658 GET requests
* 11070582 by "python-requests", essentially some unidentified robot.
* 764984   by the robots above.

On average, we are serving 4.5 requests per second, that is quite impressive.

Actions

Copy link

Updated by pjessen almost 2 years ago

Status changed from New to Feedback

I am little bit surprised there has been so little feedback here. Maybe because I neglected to add a specific proposal.

I propose we block all robots. I simply do not see what purpose any of them serve in this context.

Actions

Copy link

Updated by crameleon almost 2 years ago

There are good robots as well. Search engine indexing in particular can be useful for people outside of the openSUSE community. I would assume bad robots are likely to not respect robots.txt anyways.

Actions

Copy link

Updated by pjessen almost 2 years ago

crameleon wrote:

There are good robots as well.

Certainly. I expect the vast majority are all "good".

Search engine indexing in particular can be useful for people outside of the openSUSE community.

Maybe, but then what is their business with our ticketing system ...

I would assume bad robots are likely to not respect robots.txt anyways.

Ditto.

Actions

Copy link

Updated by crameleon about 1 year ago

Maybe, but then what is their business with our ticketing system

We are also tracking infrastructure issues which could potentially be relevant to the wider community.

I think it would be good to make a curated list of robots which do not map to known search engines and serve it via an HAProxy rule for all our websites, not just Progress. Your list above seems to be a good start, minus Googlebot and yacybot.

Actions

Copy link

Updated by pjessen about 1 year ago

crameleon wrote in #note-7:

I think it would be good to make a curated list of robots which do not map to known search engines and serve it via an HAProxy rule for all our websites, not just Progress. Your list above seems to be a good start, minus Googlebot and yacybot.

I don't know what yacybot is, but if we want to include regular search engines, bingbot and duckduckgo are probably candidates too.

Actions

Copy link

Updated by crameleon about 1 year ago

YACY is an open, federated, search engine.

Do you want to make it a black- or a whitelist? I was thinking to blacklist and just leave such search engines out.

Actions

Copy link

#10

Updated by pjessen about 1 year ago

crameleon wrote in #note-9:

YACY is an open, federated, search engine.

Do you want to make it a black- or a whitelist? I was thinking to blacklist and just leave such search engines out.

I think we have to blacklist - whitelist also means whitelisting a gazillion browser signatures, doesn't it? Even if by regex. We will still likely miss some, only creating more work for ourselves.

Actions

Copy link

#11