Similar UTF-8 strings for autocomplete field

Question

Background

Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm module.

Problem

The code resembles the following:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

When the user types fil, the query matches filbert but not filé powder. (Because of the accented character?)

Failed Solution #1

I tried to implement an unaccent function and rewrite the query as:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    unaccent( label ) % unaccent( 'fil' )
  ORDER BY
    similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC

This returns only filbert.

Failed Solution #2

As suggested:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;

CREATE OR REPLACE FUNCTION unaccent_text(text)
  RETURNS text AS
$BODY$
  SELECT unaccent($1); 
$BODY$
  LANGUAGE sql IMMUTABLE
  COST 1;

All other indexes on the table have been dropped. Then:

CREATE INDEX label_unaccent_idx 
ON the_table( lower( unaccent_text( label ) ) );

This returns only one result:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

Question

What is the best way to rewrite the query to ensure that both results are returned?

Thank you!

Are all your searches left anchored? If so, good old `LIKE` will serve you well in combination with an index using `text_pattern_ops` (and your unaccent function). I wrote [more about that here](http://dba.stackexchange.com/questions/10694/pattern-matching-with-like-similar-to-or-regular-expressions-in-postgresql/10696#10696). Also, are you aware of the [PostgreSQL unaccent module](http://www.postgresql.org/docs/current/interactive/unaccent.html)? — Erwin Brandstetter, Apr 19 '12 at 22:11
Why are you using unaccent on the literal without accents, instead of the column containing the accents? I hope you used the unaccent function in defining the index, too. If so, it won't think it can use the index without using the same function on the search conditions. — kgrittn, Apr 19 '12 at 22:20
The searches are not left anchored. I didn't know about the unaccent module. I'll look into unaccenting the index. I mistyped the `unaccent` call, the failed solution now reflects reality. — Dave Jarvis, Apr 19 '12 at 22:31
This question appears to be off-topic because it is better suited for http://dba.stackexchange.com/. — Dave Jarvis, Nov 14 '14 at 22:36

Erwin Brandstetter · Accepted Answer · 2022-04-16T03:05:27.227

You are not using the operator class provided by the pg_trgm module. Create an index like this:

CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);

Originally, I had a GIN index here, but a GiST is typically better suited for this kind of query because it can return values sorted by similarity. See:

Your query has to match the index expression to be able to use it.

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) % 'fil'
ORDER  BY similarity(label, 'fil') DESC;  -- ok to use original string here

However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect you really want:

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) LIKE 'fil%'  -- !
ORDER  BY similarity(label, 'fil') DESC;  -- ok to use original string here

This finds all strings starting with the search string, and sorts the best matches according to the % operator first.

The expression can use a GIN or GiST index since PostgreSQL 9.1! The manual:

Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example

If you actually meant to use the % operator:

Try adapting the threshold for the similarity operator %:

SET pg_trgm.similarity_threshold = 0.1;  -- Postgres 9.6 or later
SELECT set_limit(0.1);  -- Postgres 9.5 or older

Or even lower? The default is 0.3. Just to see whether the threshold filters additional matches.

@DaveJarvis: After [recent test results](http://stackoverflow.com/a/11250001/939860), I amended my answer to use a GiST index instead of GIN. May be of interest to you. (Reposted with link.) — Erwin Brandstetter, Aug 05 '12 at 03:48

Dave Jarvis · Answer 2 · 2012-04-20T00:14:50.433

A solution for PostgreSQL 9.1:

-- Install the requisite extensions.
CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;

-- Function fixes STABLE vs. IMMUTABLE problem of the unaccent function.
CREATE OR REPLACE FUNCTION unaccent_text(text)
  RETURNS text AS
$BODY$
  -- unaccent is STABLE, but indexes must use IMMUTABLE functions.
  SELECT unaccent($1); 
$BODY$
  LANGUAGE sql IMMUTABLE
  COST 1;

-- Create an unaccented index.
CREATE INDEX the_table_label_unaccent_idx
ON the_table USING gin (lower(unaccent_text(label)) gin_trgm_ops);

-- Define the matching threshold.
SELECT set_limit(0.175);

-- Test the query (matching against the index expression).
SELECT
  label
FROM
  the_table
WHERE
  lower(unaccent_text(label)) % 'fil'
ORDER BY
  similarity(label, 'fil') DESC

Returns "filbert", "fish fillet", and "filé powder".

Without calling SELECT set_limit(0.175);, you can use the double tilde (~~) operator:

-- Test the query (matching against the index expression).
SELECT
  label
FROM
  the_table
WHERE
  lower(unaccent_text(label)) ~~ 'fil'
ORDER BY
  similarity(label, 'fil') DESC

Also returns "filbert", "fish fillet", and "filé powder".

+1 I like that you supply your solution in a solid format for others. Also addresses the STABLE vs. IMMUTABLE problem. — Erwin Brandstetter, Apr 19 '12 at 23:30

Similar UTF-8 strings for autocomplete field

Background

Problem

Failed Solution #1

Failed Solution #2

Question

Related

2 Answers2

Linked