Project

General

Profile

Actions

tickets #159873

closed

Rework Mailman archive search

Added by crameleon 16 days ago. Updated 2 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Mailing lists
Target version:
-
Start date:
2023-05-25
Due date:
2023-05-25
% Done:

0%

Estimated time:

Description

The current Xapian based search index is very out of date, and updating the index of > 3532690 emails seems to take an eternity.
Alternative backends are ElasticSearch and Solr. We have an ElasticSearch cluster, but it is currently specific to the Wiki. To not build a second one, and because Mailman is a single node service anyways, Solr seems to be a good fit. It can run locally on mailman3.i.o.o for now.


Related issues 1 (0 open1 closed)

Follows openSUSE admin - tickets #129805: search results in https://lists.opensuse.org/archives/ do not show results of last two yearsResolvedcrameleon2023-05-24

Actions
Actions #1

Updated by crameleon 16 days ago

  • Status changed from New to Workable
  • Assignee set to crameleon
  • Private changed from Yes to No

Started a package in https://build.opensuse.org/package/show/home:crameleon:misc/solr-binary. It only repackages the binary, since the Gradle plugins are nearly impossible to package offline. This is not ideal, but it works very well, and the upstream release is signed.
I will link it to o:i:mailman3 and announce when I plan the migration.

Actions #2

Updated by crameleon 16 days ago

  • Due date set to 2023-05-25
  • Start date changed from 2024-05-02 to 2023-05-25
  • Follows tickets #129805: search results in https://lists.opensuse.org/archives/ do not show results of last two years added
Actions #3

Updated by crameleon 14 days ago

  • Status changed from Workable to In Progress
Actions #4

Updated by crameleon 14 days ago · Edited

Upon indexing and inspecting the result (with Solr, it's actually fun to query the data!), I find why the incremental indexing might have never worked. There are emails with a date in the future. Upon an incremental update, HyperKitty retrieves the last entry, and starts the indexing from that date. Naturally, this will not match anything "new" given the following:

$ curl 'http://localhost:8983/solr/mailman/query?q=*:*&q.op=OR&indent=true&rows=50&sort=archived_date%20desc&fl=id,date,archived_date,mailinglist,subject&fq=archived_date:%5BNOW%20TO%20*%5D&useParams=&omitHeader=true'

{
  "response":{"numFound":8,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"hyperkitty.email.3938940",
        "mailinglist":"users@lists.opensuse.org",
        "subject":"\nRe: [SuSE Linux] eth0 (network unreachable)\n",
        "date":"2036-02-07T18:17:25Z",
        "archived_date":"2036-02-07T18:17:25Z"},
      {
        "id":"hyperkitty.email.3337183",
        "mailinglist":"users-de@lists.opensuse.org",
        "subject":"\nNetscape Cache-Browser ????????\n",
        "date":"2036-02-07T07:15:47Z",
        "archived_date":"2036-02-07T07:15:47Z"},
      {
        "id":"hyperkitty.email.3336060",
        "mailinglist":"users-de@lists.opensuse.org",
        "subject":"\nRe: Netscape-Parameter\n",
        "date":"2036-02-07T07:00:15Z",
        "archived_date":"2036-02-07T07:00:15Z"},
      {
        "id":"hyperkitty.email.3333995",
        "mailinglist":"users-de@lists.opensuse.org",
        "subject":"\nQT\n",
        "date":"2036-02-07T06:35:28Z",
        "archived_date":"2036-02-07T06:35:28Z"},
      {
        "id":"hyperkitty.email.3333992",
        "mailinglist":"users-de@lists.opensuse.org",
        "subject":"\nNET DATE 2\n",
        "date":"2036-02-07T06:31:53Z",
        "archived_date":"2036-02-07T06:31:53Z"},
      {
        "id":"hyperkitty.email.3330989",
        "mailinglist":"users-de@lists.opensuse.org",
        "subject":"\nRe: NET TIME unter LINUX\n",
        "date":"2036-02-07T05:50:23Z",
        "archived_date":"2036-02-07T05:50:23Z"},
      {
        "id":"hyperkitty.email.3980366",
        "mailinglist":"users@lists.opensuse.org",
        "subject":"\n[SLE] Every 2nd time\n",
        "date":"2000-07-11T07:38:58Z",
        "archived_date":"2032-06-10T12:24:13Z"},
      {
        "id":"hyperkitty.email.3438372",
        "mailinglist":"users-de@lists.opensuse.org",
        "subject":"\nScandisk f. Linux\n",
        "date":"2029-10-05T15:55:53Z",
        "archived_date":"2029-10-05T15:55:53Z"}]
  }}

Now these are only 16 entries, which seems a manageable amount. I have to find if there's a way to change the date of these emails (in the archive sources, not just in the index, as to not have it overwritten again if we conduct a new full indexing).

Edit: upon noticing the relevant field is "archived_date" and not "date", it's only 8 entries - even better.

Actions #5

Updated by crameleon 14 days ago

Somehow, I struggle mapping these results to source files on disk.
I assumed all the raw archive data was in /var/lib/mailman/archives/prototype/. However searching all the files for the ID number, sender, subject, or even body, I don't find anything for these entries. The file names do not seem to be useful either.
Where is HyperKitty finding the emails to index?

Actions #6

Updated by crameleon 14 days ago

It seems, the email sources are actually inside the SQL database.
I can find it by the ID number using SELECT * FROM hyperkitty_email WHERE id = <id>;.

Actions #7

Updated by crameleon 14 days ago · Edited

To find the real years, I searched the archives for replies to these emails - which turned out to be not always so easy, sometimes the thread was in tact:

mailman_frontend=> SELECT subject,date,archived_date FROM hyperkitty_email WHERE parent_id = 3337183;
-[ RECORD 1 ]-+------------------------------------
subject       | Re: Netscape Cache-Browser ????????
date          | 1998-06-18 20:01:17+00
archived_date | 1998-06-18 20:01:17+00
-[ RECORD 2 ]-+------------------------------------
subject       | Re: Netscape Cache-Browser ????????
date          | 1998-06-18 22:07:17+00
archived_date | 1998-06-18 22:07:17+00

.. but sometimes it was cut apart and I had to search by subject - or by snippets from the body, if the subject was just something like "QT" - and then I assumed that the reply happened in the same year.

I updated the date and archived_date timestamps - of course, setting "archived_date" is probably not quite correct, as the archiving happened at a different date, but I think it's better to just make it match the "date" instead of trying to guess when it might have originally been archived - after all, it just needs to be in the past.

mailman_frontend=> UPDATE hyperkitty_email SET date = '2000-10-05 15:55:53+00' WHERE id = 3438372;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '2000-10-05 15:55:53+00' WHERE id = 3438372;

mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '2000-07-11 07:38:58+00' WHERE id = 3980366;

mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 05:50:23+00' WHERE id = 3330989;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 05:50:23+00' WHERE id = 3330989;

mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 06:31:53+00' WHERE id = 3333992;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 06:31:53+00' WHERE id = 3333992;

mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 06:35:28+00' WHERE id = 3333995;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 06:35:28+00' WHERE id = 3333995;

mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 07:00:15+00' WHERE id = 3336060;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 07:00:15+00' WHERE id = 3336060;

mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 07:15:47+00' WHERE id = 3337183;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 07:15:47+00' WHERE id = 3337183;

mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 18:17:25+0' WHERE id = 3938940;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 18:17:25+0' WHERE id = 3938940;
Actions #8

Updated by crameleon 14 days ago

Rebuilding the Solr index now (wanted to do just an update of the changed records, but disk space did not suffice).

Actions #9

Updated by crameleon 13 days ago · Edited

Index build finished, default backend switched. No outage besides a brief mailman-web restart.

It works unspectacularly - searching in the UI feels relatively speedy, and incremental updates happen again (tested by leaving the hourly runjob .timer stopped, finding a recent email in the lists, confirming it does not show up in the search, starting the hourly runjob service manually, confirming it does now show up in the search).

Patch in Salt pending via https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/1801.

Actions

Also available in: Atom PDF