Ordering full text search in Pleroma with GIN

Introduction

Lain gave a great rundown of the current problems with Full Text Search (FTS) in Pleroma[1]. In short the biggest problem is that while text search actually works quite well, the problem lies in the ordering of posts. The bigger the resultset is, the longer it takes to order it and it can take up way too long until a timeout is reached. The solution is to use a specific kind of indexes, called RUM indexes. These RUM indexes, however, are not part of the default Posgresql installation and require manual installation. To work around this Pleroma allows two options for FTS.

  1. You can install RUM indexes and activate its use in the config file. This will give you fast and ordered results, but is dificult to install.
  2. Use the default, which is GIN indexes. This doesn't require the admin to do extra work and the results are fast, but unordered.

But what if we did order GIN?

From the fact that the problem has to do with how big the resultset is, we can also conclude that this will be less of a problem for smaller databases. I have a single user instance running for about a year and a half and I have already pruned remote posts older than 6 months. All in all my database is 3012MB.

SELECT pg_size_pretty(pg_database_size('pleroma_dev'));
-- 3012 MB

Installing RUM indexes didn't seem to be worth the trouble to me, but I really wondered what ordered results with a GIN index would mean for my instance. Based on Lain's article I decided to run an ordered search directly on the database. The results were found in a couple of hundred milliseconds. Technically this can still be considered a lot for a query, but for my needs is good enough so I decided to see if I could make the search on my instance work ordered as well, even with GIN indexes.

The first thing I did was checking what query is used now. I set up a Pleroma instance for testing, logged all queries on the database and found

2020-02-17 10:21:32.188 CET [10087] pleroma_ynh@pleroma_ynh LOG:  execute ecto_1285: SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') WHERE (NOT (a0."actor" = ANY($1))) AND (a0."data"->>'type' = 'Create') AND ($2 = ANY(a0."recipients")) AND (to_tsvector('english', o1."data"->>'content') @@ plainto_tsquery('english', $3)) LIMIT $4 OFFSET $5
2020-02-17 10:21:32.188 CET [10087] pleroma_ynh@pleroma_ynh DETAIL:  parameters: $1 = '{}', $2 = 'https://www.w3.org/ns/activitystreams#Public', $3 = 'tester', $4 = '40', $5 = '0'

Something that pop's out is the use of to_tsvector, so I grepped the Pleroma code for files containing this and found lib/pleroma/activity/search.ex. Here I found two separate functions for GIN and RUM indexes. Looking at both, trying things out and using the internet to search more about how the syntax is for ordering results in Ecto (the library that is used for database connections in Elixir), I was able to change

  defp query_with(q, :gin, search_query) do
    from([a, o] in q,
      where:
        fragment(
          "to_tsvector('english', ?->>'content') @@ plainto_tsquery('english', ?)",
          o.data,
          ^search_query
        )
    )
  end

to

  defp query_with(q, :gin, search_query) do
    from([a, o] in q,
      where:
        fragment(
          "to_tsvector('english', ?->>'content') @@ plainto_tsquery('english', ?)",
          o.data,
          ^search_query
        ),
      order_by: {:desc, o.id}
    )
  end

I then restarted my instance and that's it! I noticed it can still easily take several seconds before I get results, but the results are nicely ordered and that's a huge improvement to how it was before IMO.

Logging queries

One of the things I did was logging the queries. For this I took my main instance down so that there wouldn't be other queries made besides the one from the testing instance. Then I did

# Note that my Postgresql version is 9.6
nano /etc/postgresql/9.6/main/postgresql.conf
# Change log_statement to 'all' (to deactivate you can simply comment the line)
systemctl reload postgresql
tail -f /var/log/postgresql/postgresql-9.6-main.log

and then I did a search from the web interface with a word I could easily find back in the logs (in this case 'tester').

But how future proof is this?

It should be clear that this will only work with small databases. The bigger the database gets, the longer these searches will take and after a while, they may not work at all any more because of timeouts. Maybe RUM indexes will become easier to install in the future, or maybe even become part of Postgres itself. When that happens, switching to RUM indexes will undoubtedly be the smartest move.

It should also be clear that we've changed the source code which means that upgrading the instance will take a bit more care to avoid breaking things.

Updates

2020-04-12 Made better explanation between the advantages and disadvantages of RUM and GIN indexes

Sources

[1] https://blog.soykaf.com/post/postgresl-front-report/