tickets #159873
closedRework Mailman archive search
0%
Description
The current Xapian based search index is very out of date, and updating the index of > 3532690 emails seems to take an eternity.
Alternative backends are ElasticSearch and Solr. We have an ElasticSearch cluster, but it is currently specific to the Wiki. To not build a second one, and because Mailman is a single node service anyways, Solr seems to be a good fit. It can run locally on mailman3.i.o.o for now.
Updated by crameleon 9 months ago
- Status changed from New to Workable
- Assignee set to crameleon
- Private changed from Yes to No
Started a package in https://build.opensuse.org/package/show/home:crameleon:misc/solr-binary. It only repackages the binary, since the Gradle plugins are nearly impossible to package offline. This is not ideal, but it works very well, and the upstream release is signed.
I will link it to o:i:mailman3 and announce when I plan the migration.
Updated by crameleon 9 months ago
- Due date set to 2023-05-25
- Start date changed from 2024-05-02 to 2023-05-25
- Follows tickets #129805: search results in https://lists.opensuse.org/archives/ do not show results of last two years added
Updated by crameleon 9 months ago · Edited
Upon indexing and inspecting the result (with Solr, it's actually fun to query the data!), I find why the incremental indexing might have never worked. There are emails with a date in the future. Upon an incremental update, HyperKitty retrieves the last entry, and starts the indexing from that date. Naturally, this will not match anything "new" given the following:
$ curl 'http://localhost:8983/solr/mailman/query?q=*:*&q.op=OR&indent=true&rows=50&sort=archived_date%20desc&fl=id,date,archived_date,mailinglist,subject&fq=archived_date:%5BNOW%20TO%20*%5D&useParams=&omitHeader=true'
{
"response":{"numFound":8,"start":0,"numFoundExact":true,"docs":[
{
"id":"hyperkitty.email.3938940",
"mailinglist":"users@lists.opensuse.org",
"subject":"\nRe: [SuSE Linux] eth0 (network unreachable)\n",
"date":"2036-02-07T18:17:25Z",
"archived_date":"2036-02-07T18:17:25Z"},
{
"id":"hyperkitty.email.3337183",
"mailinglist":"users-de@lists.opensuse.org",
"subject":"\nNetscape Cache-Browser ????????\n",
"date":"2036-02-07T07:15:47Z",
"archived_date":"2036-02-07T07:15:47Z"},
{
"id":"hyperkitty.email.3336060",
"mailinglist":"users-de@lists.opensuse.org",
"subject":"\nRe: Netscape-Parameter\n",
"date":"2036-02-07T07:00:15Z",
"archived_date":"2036-02-07T07:00:15Z"},
{
"id":"hyperkitty.email.3333995",
"mailinglist":"users-de@lists.opensuse.org",
"subject":"\nQT\n",
"date":"2036-02-07T06:35:28Z",
"archived_date":"2036-02-07T06:35:28Z"},
{
"id":"hyperkitty.email.3333992",
"mailinglist":"users-de@lists.opensuse.org",
"subject":"\nNET DATE 2\n",
"date":"2036-02-07T06:31:53Z",
"archived_date":"2036-02-07T06:31:53Z"},
{
"id":"hyperkitty.email.3330989",
"mailinglist":"users-de@lists.opensuse.org",
"subject":"\nRe: NET TIME unter LINUX\n",
"date":"2036-02-07T05:50:23Z",
"archived_date":"2036-02-07T05:50:23Z"},
{
"id":"hyperkitty.email.3980366",
"mailinglist":"users@lists.opensuse.org",
"subject":"\n[SLE] Every 2nd time\n",
"date":"2000-07-11T07:38:58Z",
"archived_date":"2032-06-10T12:24:13Z"},
{
"id":"hyperkitty.email.3438372",
"mailinglist":"users-de@lists.opensuse.org",
"subject":"\nScandisk f. Linux\n",
"date":"2029-10-05T15:55:53Z",
"archived_date":"2029-10-05T15:55:53Z"}]
}}
Now these are only 16 entries, which seems a manageable amount. I have to find if there's a way to change the date of these emails (in the archive sources, not just in the index, as to not have it overwritten again if we conduct a new full indexing).
Edit: upon noticing the relevant field is "archived_date" and not "date", it's only 8 entries - even better.
Updated by crameleon 9 months ago
Somehow, I struggle mapping these results to source files on disk.
I assumed all the raw archive data was in /var/lib/mailman/archives/prototype/
. However searching all the files for the ID number, sender, subject, or even body, I don't find anything for these entries. The file names do not seem to be useful either.
Where is HyperKitty finding the emails to index?
Updated by crameleon 9 months ago · Edited
To find the real years, I searched the archives for replies to these emails - which turned out to be not always so easy, sometimes the thread was in tact:
mailman_frontend=> SELECT subject,date,archived_date FROM hyperkitty_email WHERE parent_id = 3337183;
-[ RECORD 1 ]-+------------------------------------
subject | Re: Netscape Cache-Browser ????????
date | 1998-06-18 20:01:17+00
archived_date | 1998-06-18 20:01:17+00
-[ RECORD 2 ]-+------------------------------------
subject | Re: Netscape Cache-Browser ????????
date | 1998-06-18 22:07:17+00
archived_date | 1998-06-18 22:07:17+00
.. but sometimes it was cut apart and I had to search by subject - or by snippets from the body, if the subject was just something like "QT" - and then I assumed that the reply happened in the same year.
I updated the date and archived_date timestamps - of course, setting "archived_date" is probably not quite correct, as the archiving happened at a different date, but I think it's better to just make it match the "date" instead of trying to guess when it might have originally been archived - after all, it just needs to be in the past.
mailman_frontend=> UPDATE hyperkitty_email SET date = '2000-10-05 15:55:53+00' WHERE id = 3438372;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '2000-10-05 15:55:53+00' WHERE id = 3438372;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '2000-07-11 07:38:58+00' WHERE id = 3980366;
mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 05:50:23+00' WHERE id = 3330989;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 05:50:23+00' WHERE id = 3330989;
mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 06:31:53+00' WHERE id = 3333992;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 06:31:53+00' WHERE id = 3333992;
mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 06:35:28+00' WHERE id = 3333995;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 06:35:28+00' WHERE id = 3333995;
mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 07:00:15+00' WHERE id = 3336060;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 07:00:15+00' WHERE id = 3336060;
mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 07:15:47+00' WHERE id = 3337183;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 07:15:47+00' WHERE id = 3337183;
mailman_frontend=> UPDATE hyperkitty_email SET date = '1998-02-07 18:17:25+0' WHERE id = 3938940;
mailman_frontend=> UPDATE hyperkitty_email SET archived_date = '1998-02-07 18:17:25+0' WHERE id = 3938940;
Updated by crameleon 9 months ago · Edited
Index build finished, default backend switched. No outage besides a brief mailman-web restart.
It works unspectacularly - searching in the UI feels relatively speedy, and incremental updates happen again (tested by leaving the hourly runjob .timer stopped, finding a recent email in the lists, confirming it does not show up in the search, starting the hourly runjob service manually, confirming it does now show up in the search).
Patch in Salt pending via https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/1801.
Updated by crameleon 7 months ago
- Related to tickets #108641: mailing list search lagging? added
Updated by crameleon 7 months ago
- Related to tickets #107167: archiving stuck again added
Updated by crameleon 7 months ago
- Related to tickets #88903: Fwd: support@lists.opensuse.org - search index not up-to-date added
Updated by crameleon 7 months ago
- Related to tickets #100802: List archive search broken? added