How to search efficiently on multiple columns at once in PostgreSQL

Assume we want to build an application that requires searching for entities in big tables. The used search queries are mostly not just equality checks, but input fields in an application that should return results after entering some letters. We will utilize the built-in pattern-matching mechanism of PostgreSQL for this task, which is a widely used and straightforward approach.

However, as you will see in this article, it gets tricky when the search spans multiple columns or complex data structures, such as JSON documents.

The simple scenario

Let's start simple: Assume we have an application that stores information about thousands of people in a single table called person. We want to search for their names by simply entering a few letters.

1create table person(
2    id int primary key,
3    name text not null
4);
5
6insert into person(id, name)
7values
8    (1, 'Alice Miller'),
9    (2, 'Bob Johnson'),
10    (3, 'Clara Smith'),
11    (4, 'David Brown'),
12    (5, 'Emma Johnson');
13

Now, let's see what we can find searching for joh by using the key-insensitive pattern matching operator ilike:

1select * from person
2where name ilike '%joh%';  -- the percent sign (%) matches any sequence of zero or more characters
3

id	name
2	Bob Johnson
5	Emma Johnson

This was an easy task, which works perfectly on a small table, but if we want to perform the same search on a huge table, the response time is no longer good enough.

The large table scenario

Fortunately, there is a solution to this problem, namely an extension called pg_trgm¹. This extension splits sentences into so-called trigrams, which are groups of letters of an input string, written down into an index, that is then searchable with like-operators. Pattern matching is essentially a count of overlapping trigrams. An additional advantage of that approach is that a search term might even contain spelling errors, but would still find some overlaps.

First, we must install the extension and create a GIN index using the gin(name gin_trgm_ops) indexing method. The GIN operator class gin_trgm_ops tells PostgreSQL to break name into trigrams and to map operators, as for instance like or % (similarity operator), to that GIN index.

1create extension pg_trgm;
2create index person_name_idx on person using gin(name gin_trgm_ops);
3

In a non-representative test with 3 million people, executing the query we used above, we've got the following measurements:

index	operator	query time
no	`like`	70ms
no	`ilike`	270ms
yes	`like`	4ms
yes	`ilike`	4ms

As you can see, using the trigram pattern matching index, the querying speed is an order of magnitude faster than just using a plain like operator.

However, this was a simple example with just one column, but what if we want to span our search across multiple columns? The index creation does not support this, but there is a solution to the problem: concatenating columns into a single, indexable string.

The large table scenario with multiple columns

As discussed before, we need a scenario with multiple columns. So, we create a new person table, as follows:

1create table person(
2    id int primary key,
3    first_name text not null,
4    last_name text not null
5);
6
7insert into person(id, first_name, last_name)
8values
9    (1, 'Alice', 'Miller'),
10    (2, 'Bob', 'Johnson'),
11    (3, 'Clara', 'Smith'),
12    (4, 'David', 'Brown'),
13    (5, 'Emma', 'Johnson');
14

Now, we have two possibilities to create indexes for better performance. Either we create one index per column and query over both, but with a single search term, such as searching for bob joh, nothing might be found because that term does not match either first_name or last_name. Hence, we create a single index by concatenating both columns. For simplicity, to keep this article concise, we assume that the search is sanitized to have only single-space characters between terms.

1create index person_name_idx on person
2using gin((first_name || ' ' || last_name) gin_trgm_ops);
3

To build a query, that can use the index, we must have the same expression in our where-clause:

1select * from person
2where first_name || ' ' || last_name ilike '%bob joh%';
3

Optional columns

Until now, we had only mandatory (non-nullable) columns. Let's add an optional field to store email addresses.

1create table person(
2    id int primary key,
3    first_name text not null,
4    last_name text not null,
5    email text
6);
7
8insert into person(id, first_name, last_name, email)
9values
10    (1, 'Alice', 'Miller', 'info@example.com'),
11    (2, 'Bob', 'Johnson', 'bob@jj-example.com'),
12    (3, 'Clara', 'Smith', null),
13    (4, 'David', 'Brown', 'david@brown-example.com'),
14    (5, 'Emma', 'Johnson', null);
15

Looks easy at first. By simply concatenating the email field to the query, the result should give us all we need. Let's try it.

1select * from person
2where first_name || ' ' || last_name || ' ' || email ilike '%johnson%';
3

The query above does not return people without an email, because concatenating a string with null results in null on the left-hand side, which never matches the term on the right. To solve this problem, we use coalesce, which returns the first non-null parameter, that is, we default to '', if a person has no email. So the correct query and corresponding index must look as follows:

1select * from person
2where first_name || ' ' || last_name || ' ' || coalesce(email, '') ilike '%johnson%';
3
4create index person_name_idx on person
5using gin((first_name || ' ' || last_name || ' ' || coalesce(email, '')) gin_trgm_ops);
6

Searching complex data

Assume we have a complex JSON array containing objects that store phone numbers with their phone type, and we want to search over such data together with the other fields:

1create table person(
2    id int primary key,
3    first_name text not null,
4    last_name text not null,
5    email text,
6    phone_numbers jsonb not null
7);
8
9insert into person(id, first_name, last_name, email, phone_numbers)
10values
11    (1, 'Alice', 'Miller', 'info@example.com', '[{"number": "123 456/1234", "type": "mobile"}]'),
12    (2, 'Bob', 'Johnson', 'bob@jj-example.com', '[]'),
13    (3, 'Clara', 'Smith', null, '[]'),
14    (4, 'David', 'Brown', 'david@brown-example.com', '[]'),
15    (5, 'Emma', 'Johnson', null, '[{"number": "567 5666", "type": "home"}]');
16

To use pattern matching, we must flatten the JSON structure into a string. We will use a comma-separated value representation for that. First, we use jsonb_array_elements to fetch all elements in the JSON array, then we cast it into a native array and transform it into a string by concatenating all elements with a comma (,).

1select * from person
2where first_name
3      || ' ' || last_name
4      || ' ' || coalesce(email, '')
5      || ' ' || array_to_string(array(
6                    select jsonb_array_elements(phone_numbers)->>'number'), ',', '')
7      ilike '%123%';
8

The query works nicely and returns Alice Miller as expected. So let's create an index from that expression.

1create index person_name_idx on person
2using gin((first_name || ' ' || last_name || ' ' || coalesce(email, '') || ' ' ||
3    array_to_string(array(select jsonb_array_elements(phone_numbers)->>'number'), ',', '')) gin_trgm_ops);
4

Unfortunately, this approach does not work, due to two problems in the expression we gave. First, an index expression cannot have a subquery, and secondly, all functions used in an index must be immutable. To overcome this issue, we must create an immutable function²:

1create or replace function search_field_string (
2    phone_numbers jsonb,
3    first_name text,
4    last_name text,
5    email text
6)
7returns text
8as $$
9declare
10    SEPARATOR constant text = '+++';  -- Separator `+++` between concatenated elements in our search_field
11                                      -- We use a more complex string here, such that we do not match wrongly with
12                                      -- 'JOHN INFO' for instance, if John would have an email like `info@example.com`
13begin
14    -- join all elements of the constructed array with a separator `+++`
15    return array_to_string(
16        array[
17            concat_ws(' ', first_name, last_name),  -- mandatory name columns, which match searches like `JOHN DOE`
18            concat_ws(' ', last_name, first_name),  -- mandatory name columns, which match searches like `DOE JOHN`
19            coalesce(email,''),        -- an optional email column, which should default to an empty string if NULL
20
21            -- a json-array of objects, that contain phone numbers
22            -- we keep only digits and concatenate them with the separator `+++`
23            array_to_string(
24                array(select regexp_replace(jsonb_array_elements(phone_numbers)->>'number', '[^0-9]', '', 'g')),
25                SEPARATOR, ''
26            )
27        ],
28        SEPARATOR, ''
29    );
30end
31$$
32language plpgsql
33immutable;
34

Finally, we can materialize the result of the function in a generated column to use it for our new trigram search index. The generated column is an optional step to have a better understanding of what our search strings look like. We could also directly use the function while generating the index. However, it is convenient to have that column to simplify the where-clause in search queries.

1-- Add a generated (materialized) column that stores the result of our function
2alter table person
3    add column search_field text generated always as (
4        text (search_field_string(phone_numbers, first_name, last_name, email))
5    ) stored;
6
7-- Create a trigram pattern matching index on top of the generated column `search_field`
8create index person_search_field_idx on person using gin(search_field gin_trgm_ops);
9

Let's have a glance at the person table:

id	first_name	last_name	email	phone_numbers	search_field
1	Alice	Miller	info@example.com	`[{"type": "mobile", "number": "123 456/1234"}]`	`Alice Miller+++Miller Alice+++info@example.com+++1234561234`
2	Bob	Johnson	bob@jj-example.com	`[]`	`Bob Johnson+++Johnson Bob+++bob@jj-example.com+++`
3	Clara	Smith		`[]`	`Clara Smith+++Smith Clara++++++`
4	David	Brown	david@brown-example.com	`[]`	`David Brown+++Brown David+++david@brown-example.com+++`
5	Emma	Johnson		`[{"type": "home", "number": "567 5666"}]`	`Emma Johnson+++Johnson Emma++++++5675666`

Searching has now become much easier. Due to the new search_field column, the where-clause expression no longer needs string concatenation:

1select * from person
2where search_field ilike '%example%';
3

The search_field comes with some drawbacks: obviously, the column requires storage, and during insert and update operations, the function call will require extra execution time.

Spelling mistakes

Finally, as we mentioned earlier, a trigram index supports searching by similarity, allowing us to obtain some results even when the input contains spelling errors. Assume the user wants to find people with the surname Johnson, but enters jon as a search term. The like operators would not return expected results, but using another operator, namely %>, searches by word-similarity and therefore returns results if the similarity threshold is met.

1select * from person
2where search_field %> 'jon'
3order by word_similarity(search_field, 'jon') desc;
4

We also sort by similarity scores to have the most relevant results on top³.

Similarity search and scoring is a huge, interesting topic, worth a separate blog post. Stay tuned!

Conclusion

In this blog post, we have seen that we can search for patterns in text fields by using the pattern matching operators like or ilike, or the word-similarity operator %> together with a trigram search index to significantly improve performance. Unfortunately, the index creation does not support spanning multiple columns, but we can work around this problem by concatenating fields of various complexities, optionality and type through an immutable function. The materialized results can be used much easier in search queries and during index creation. Used wisely, the performance increase should overrule the storage penalty of the newly created generated search column.

Footnotes

Support for similarity of a text using a trigram matching extension ↩
Function volatility categories explains in more depth what an immutable function is ↩
The word_similarity function puts a too high penalty to the length of our search_field, so that shorter fields get a higher score. We would need to investigate on that and find a better similarity score function. ↩