Search is not magic with PostgreSQL
Well, I lied. Doing a good search with the right results for your data is indeed magic. But thanks to the people who work on PostgreSQL and extensions it’s not really magic. It can be a lot easier than you think.

Here I want to show you several approaches making search better for your data. I will show you powerful tools and some tips that I know thanks to days and weeks in it. In order not to be pure SQL I will also show implementation of approaches in the Elixir language with the great Ecto library. It’s important to say that this “problem” can not be solved only in the database. It has to be solved on both places where you have logic and data.
A long time ago, when someone said “search” to me, the first thing I thought of was:
select * from products where name ilike '%banana%`;
It’s easy with Ecto:
text = "banana"
from(p in Products, where: ilike(p.name, "%#{text}%"))
|> Repo.all()
Is it good enough? Is it safe? Well, it worked. We can still say that it works in some cases. But people are more lazy at these days, they don’t pay attention to what they write. So if customers write bannana
it will not work. Because all your products with the name banana
will just not match. You can say:
Yeah, I don’t care, it’s just a search…
Well, if you think it doesn’t matter you can Google it or ask someone. For example some random statistics from the internet:
43% of retail customers head directly to the search bar on a website.
On-site search is used more frequently than the navigation menu (49%), filter feature (37%), and homepage recommendations (30%), with 78%.
As many as 68% of customers would not return to a site with a poor search experience.
Source: https://www.mailmodo.com/guides/ecommerce-site-search-statistics/
And I know a lot of sites where you can see that more than 80% of customers are using search instead of categories.
Let’s do a very simple calculation with fake data: You have 1000 visitors (per day/month, doesn’t matter). Based on the text below 790 visitors will use search. Let’s say that 30% will get wrong results in search. That’s 237 visitors who will get wrong results. What does it mean? Let’s say you have a 20% conversion rate for converting regular visitors to paying customers (who have bought something). That’s 47.4 customers you lost.
Let’s say your average shopping cart is 80 (EUR, Dollars, …). Basically this means that you have lost 3 792 per day/month/… . Take your statistics and work this out for yourself, or ask your boss to do it. Have I hot your attention?
Are there other options?
You probably already know the answer, so take a look at what we have on our plate for today with Elixir/Ecto and Postgresql:
- Similarity
- TS Vectors
- Semantic Search
- Hybrid Search
- 3rd party engines/providers
Similarity
It’s a function or logic in Postgresql thanks to the pg_trgm extension, definition from docs is:
The
pg_trgm
module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.
Ok, but what about trigrams? What are they? It’s important to understand what you’re doing, not just to use a function without knowing what it does. Even more so when it’s not that complicated to understand it.
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.
That’s much better but let’s show you some examples to understand it better.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
select show_trgm('banana');
=> { b, ba,ana,ban,na ,nan}
select show_trgm('effective for measuring the similarity');
=> { e, f, m, s, t, ef, fo, me, si, th,ari,asu,cti,eas,ect,eff,fec,ffe,for,he ,ila,imi,ing,ity,ive,lar,mea,mil,ng ,or ,rin,rit,sim,sur,the,tiv,ty ,uri,ve }
Basically super simple explanation: it splits text into multiple small texts and compares them with each other and counts how many of these small parts are the same.
First, this function just shows how text is split into multiple strings that are compared and ranked for the actual search. Another function is used for the search:
select name, similarity(name, 'banana')
from products
order by 2 desc;
=>
+-------------+----------+
|name |similarity|
+-------------+----------+
|Banana Bread |0.54545456|
|Banan (lime) |0.5 |
|Banana Milk. |0.46153846|
+-------------+----------+
As you can see, it’s quite simple. Thanks to the ranking you can sort by it, filter because it will show you ranking for everything in the table. So you will also have products with a ranking of 0. It depends on your dataset which sensitivity you want to achieve. Ideally you should try to limit it like this:
select name, similarity(name, 'banana')
from products
where similarity(name, 'banana') > 0.241
order by 2 desc;
This allows you to see how many results you will get and which results are at the end of the results. If it makes sense to show them last, try to reduce the number and see results even later. This will help you find the ideal number for you. Don’t be afraid to use a longer number with more decimals.
What about Elixir and Ecto? There is nothing special to use if you allow extension in your database.
text = "banana"
from(p in Products,
where: fragment("similarity(?, ?)", p.name, ^text) > 0.241,
order_by: [desc: fragment("similarity(?, ?)", p.name, ^acc_query)]
)
|> Repo.all()
Don’t afraid to use similarity on multiple attributes together and combine them to get the best results. Thanks to indexes you can get really fast responses. If you have several attributes or even a table merged with UNION
or something like that, you can add constants as weights to lower some results. Because maybe it’s more important to search in the product name than in ingredients, brands, shops, etc.
Are we done? Is this the best what we can do for the search? Again, it depends on your dataset and your usage. It may be sufficient, but may also have a lot of false positive search results. Just because some words contain the same trigrams, “small parts”. It can be enough thanks to speed and a lot of results. But human languages are complex, there are different words with same meaning, changing letters in words like 1 banana and 2 bananas, etc. This can play an important role in search.
TS Vectors
CREATE EXTENSION IF NOT EXISTS vector;
To move forward and a little speed up reading a little, we can simplify it. Search with TS Vectors parses text into tokens based on the language configuration. Easier! Text is parsed into small words based on language, so for example cakes
is parsed to cake
with English language. Have a look:
select to_tsvector('english', 'banana cakes for birthday parties');
=> 'banana':1 'birthday':4 'cake':2 'parti':5
See how sensitive it is when you change the language:
select to_tsvector('norwegian', 'banana cakes for birthday parties');
=> 'banan':1 'birthday':4 'cak':2 'parti':5
cakes
is not cake but cak
because the system used a different dictionary to define the tokens and created a list of lexemes which are similar but different. It’s a simple example but imagine some complications in your language which the change the way the word looks. You know better than me. But look at some examples in other languages:
SELECT to_tsvector('french', 'économie française est en croissance rapide.');
=> 'croissanc':5 'français':2 'rapid':6 'économ':1
SELECT to_tsvector('german', 'Die wirtschaftliche Lage in Deutschland ist stabil und wachsend.');
=> 'deutschland':5 'lag':3 'stabil':7 'wachsend':9 'wirtschaft':2
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog.');
=> 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
It’s important to say how the results are cleaned up for unimportant words. See the last example with English. All the
are ignored. See also examples in other languages. They’re called stop words
and are basically discarded to make the search more clearer.
Ok first part is done, now the query itself, that’s a little bit complicated. Simple example:
select to_tsquery('english','cooking');
=> 'cook'
Nice, but not usable, let’s see in action:
select name
from products
where to_tsvector('english', name) @@ to_tsquery('english', 'cooking');
=>
+-------------------------------+
|name |
+-------------------------------+
|Beef - Cooked, Corned |
|Ham - Cooked Italian |
|Ham - Cooked |
|Oil - Cooking Spray |
|Pork - Bacon Cooked Slcd |
|Lobster - Cooked |
|Shrimp - 16 - 20 Cooked, Peeled|
+-------------------------------+
Well, it works but it would be nice to add sorting so that we are sure that the best results are at the top.
select name, ts_rank_cd(to_tsvector('english', name), to_tsquery('english', 'cooking'))
from products
where to_tsvector('english', name) @@ to_tsquery('english', 'cooking')
order by ts_rank_cd(to_tsvector('english', name), to_tsquery('english', 'cooking'));
=>
+-------------------------------+----------+
|name |ts_rank_cd|
+-------------------------------+----------+
|Beef - Cooked, Corned |0.1 |
|Ham - Cooked Italian |0.1 |
|Ham - Cooked |0.1 |
|Oil - Cooking Spray |0.1 |
|Pork - Bacon Cooked Slcd |0.1 |
|Lobster - Cooked |0.1 |
|Shrimp - 16 - 20 Cooked, Peeled|0.1 |
+-------------------------------+----------+
This function (
ts_rank_cd
) computes the cover density ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope’s “Relevance Ranking for One to Three Term Queries” in the journal “Information Processing and Management”, 1999. Cover density is similar tots_rank
ranking except that the proximity of matching lexemes to each other is taken into consideration.
I’ve added thets_rank_cd
column just to show you what’s happening. This function uses weights to give a rank to the results. The defaults are
{0.1, 0.2, 0.4, 1.0}
Weights are used for labelled words/vectors. Labeled words, vectors? What? Let’s explain how it works by default.
Weights are defined by label, labels are {D, C, B, A}
so weight for label D
is 0.1
, etc. If weight is not defined for word/column, label D
is used. To make it even simpler, our search term cooking
is (after being converted to ts vectors) is only present once in the result column, so the is 1 * 0.1
. So in this case it would be wise to add another sorting based on your business logic, for example sorting by how many times the product has been bought.
However, search only by product name is most often used in conjunction with searching by description.
SELECT name,
ts_rank_cd(
to_tsvector('english', name) ||
to_tsvector('english', description),
to_tsquery('english', 'cooking')
) AS rank
FROM products
WHERE (to_tsvector('english', name) ||
to_tsvector('english', description)) @@ to_tsquery('english', 'cooking')
ORDER BY rank DESC;
This takes vectors for name
and description
and concat vectors. Or you can use regular or
expression:
SELECT name,
ts_rank_cd(
to_tsvector('english', name) ||
to_tsvector('english', description),
to_tsquery('english', 'cooking')
) AS rank
FROM products
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'cooking')
OR to_tsvector('english', description) @@ to_tsquery('english', 'cooking')
ORDER BY rank DESC;
=>
+---------------------------------+----+
|name |rank|
+---------------------------------+----+
|Handcrafted Italian Pasta |0.7 |
|Pasta - Penne Primavera, Single |0.1 |
|Pasta - Cappellini, Dry |0.1 |
|Pasta - Tortellini, Fresh |0.1 |
|Pasta - Fusili, Dry |0.1 |
|Pasta - Penne, Rigate, Dry |0.1 |
|Pasta - Fusili Tri - Coloured |0.1 |
|Pasta - Rotini, Colour, Dry |0.1 |
|Pasta - Shells, Medium, Dry |0.1 |
|Pasta - Cannelloni, Sheets, Fresh|0.1 |
|Pasta - Orzo, Dry |0.1 |
|Pasta - Lasagna, Dry |0.1 |
|Pasta - Fettuccine, Egg, Fresh |0.1 |
|Pasta - Lasagna Noodle, Frozen |0.1 |
|Pasta - Linguini, Dry |0.1 |
+---------------------------------+----+
But it’s important to note that if you use two columns to search, you can have two ranks. There are many ways to combine two ranks. By default you can concat both vectors and do a rank based on that, like in the previous example, so one rank. These ranks are just numbers so you can do whatever your maths skills will allow you.
So now we have a search in two columns but it’s more important to show products that match the name better than description. So now we can use this weight.
SELECT name,
ts_rank_cd(
setweight(to_tsvector('english', name), 'D') ||
setweight(to_tsvector('english', description), 'D'),
to_tsquery('english', 'pasta')
) AS rank
FROM products
WHERE setweight(to_tsvector('english', name), 'D') @@ to_tsquery('english', 'pasta')
OR setweight(to_tsvector('english', description), 'D') @@ to_tsquery('english', 'pasta')
ORDER BY rank DESC
LIMIT 20;
=>
+---------------------------------+----+
|name |rank|
+---------------------------------+----+
|Handcrafted Italian Pasta |0.7 |
|Pasta - Penne Primavera, Single |0.1 |
|Pasta - Cappellini, Dry |0.1 |
|Pasta - Tortellini, Fresh |0.1 |
|Pasta - Fusili, Dry |0.1 |
|Pasta - Penne, Rigate, Dry |0.1 |
|Pasta - Fusili Tri - Coloured |0.1 |
|Pasta - Rotini, Colour, Dry |0.1 |
|Pasta - Shells, Medium, Dry |0.1 |
|Pasta - Cannelloni, Sheets, Fresh|0.1 |
|Pasta - Orzo, Dry |0.1 |
|Pasta - Lasagna, Dry |0.1 |
|Pasta - Fettuccine, Egg, Fresh |0.1 |
|Pasta - Lasagna Noodle, Frozen |0.1 |
|Pasta - Linguini, Dry |0.1 |
+---------------------------------+----+
You can see the exact same results. But we want the match with the product name to be at the top and the description to be a little lower.
SELECT name,
ts_rank_cd(
'{0.1, 0.2, 0.9, 1.0}',
setweight(to_tsvector('english', name), 'A') ||
setweight(to_tsvector('english', description), 'B'),
to_tsquery('english', 'pasta')
) AS rank
FROM products
WHERE setweight(to_tsvector('english', name), 'A') @@ to_tsquery('english', 'pasta')
OR setweight(to_tsvector('english', description), 'B') @@ to_tsquery('english', 'pasta')
ORDER BY rank DESC
LIMIT 20;
=>
+---------------------------------+---------+
|name |rank |
+---------------------------------+---------+
|Handcrafted Italian Pasta |6.3999996|
|Pasta - Penne Primavera, Single |1 |
|Pasta - Cappellini, Dry |1 |
|Pasta - Tortellini, Fresh |1 |
|Pasta - Fusili, Dry |1 |
|Pasta - Penne, Rigate, Dry |1 |
|Pasta - Fusili Tri - Coloured |1 |
|Pasta - Rotini, Colour, Dry |1 |
|Pasta - Shells, Medium, Dry |1 |
|Pasta - Cannelloni, Sheets, Fresh|1 |
|Pasta - Orzo, Dry |1 |
|Pasta - Lasagna, Dry |1 |
|Pasta - Fettuccine, Egg, Fresh |1 |
|Pasta - Lasagna Noodle, Frozen |1 |
|Pasta - Linguini, Dry |1 |
+---------------------------------+---------+
So this query will set weight to name with label A
and label B
for description. It also sets actual numbers for these labels. The results are the same, but the ranking is different. You can see howthe ranking has changed. But the current dataset doesn’t show big differences.
It’s important to have correct data for the search to work correctly.
Let’s change the dataset. I found on good old internet (without AI) this dataset with products (bottles, nothing better, sorry) from data.world.
SELECT name,
category,
ts_rank_cd(
'{0.1, 0.2, 0.9, 1.0}',
setweight(to_tsvector('english', name), 'A') ||
setweight(to_tsvector('english', category), 'B'),
to_tsquery('english', 'decanters')
) AS rank
FROM bottle_products
WHERE setweight(to_tsvector('english', name), 'A') @@ to_tsquery('english', 'decanters')
OR setweight(to_tsvector('english', category), 'B') @@ to_tsquery('english', 'decanters')
ORDER BY rank DESC
LIMIT 20;
=>
+-------------------------------------------+------------------------------+----+
|name |category |rank|
+-------------------------------------------+------------------------------+----+
|Giori Casino Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Evan Williams Master Select Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Giori Two Dolphins Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Courvoisier Erte Vigne Cognac Decanter 1993|DECANTERS & SPECIALTY PACKAGES|1.9 |
|Giori Octopus Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Giori Butterfly Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Courvoisier Erte Decanter 1994 |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Giori Unicorn Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Drambuie Liqueur W/empty Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Jack Daniels 1954 Gold Medal Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Jack Daniels 1914 Gold Medal Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Jack Daniels 1913 Gold Medal Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Giori Golf Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Jim Beam Black Holiday Decanter |DECANTERS & SPECIALTY PACKAGES|1.9 |
|Jack Daniels 1904 Gold Medal Decanter |TENNESSEE WHISKIES |1 |
|Jack Daniels 1905 Gold Medal Decanter |TENNESSEE WHISKIES |1 |
|Jack Daniels 150th Anniversary Decanter |TENNESSEE WHISKIES |1 |
|Cutty Sark W/decanter |DECANTERS & SPECIALTY PACKAGES|0.9 |
|Cutty Sark Scotch W/ceramic Tumblers |DECANTERS & SPECIALTY PACKAGES|0.9 |
|Glenmorangie 10yo W/glass |DECANTERS & SPECIALTY PACKAGES|0.9 |
+-------------------------------------------+------------------------------+----+
Nothing big but… So we search for word dencaters
. You can see a better ranking for products containing word decanter
in the name and also in the category. So thanks to this we will see all products in this order:
- with name with search term and products from relevant category,
- with name with search term from other categories,
- other products from the relevant category.
Let’s see the Elixir code for this:
text = "decanters"
weights = "{0.1, 0.2, 0.9, 1.0}"
language = "english"
limit = 20
offset = 0
from(bp in BottleProducts,
where:
fragment(
"setweight(to_tsvector(?::text::regconfig', ?), 'A') @@ to_tsquery(?::text::regconfig, ?)",
^language,
bp.name,
^language,
^text
) or
fragment(
"setweight(to_tsvector(?::text::regconfig', ?), 'B') @@ to_tsquery(?::text::regconfig, ?)",
^language,
bp.category,
^language,
^text
),
limit: ^limit,
offset: ^offset,
order_by: [
desc:
fragment(
"ts_rank_cd(?, setweight(to_tsvector(?::text::regconfig, ?), 'A') || setweight(to_tsvector(?::text::regconfig, ?), 'B'))",
^weights,
^language,
bp.name,
^language,
bp.category
)
]
)
|> Repo.all()
It’s a pretty long query and basically all the logic is in fragment where we have raw SQL. There is no other way to do it. You can write some macros which will show it much nicer in case you will use it on multiple places. Maybe there are some libraries for Ecto but basically it will just be a wrapper around fragment as in our case.
This type of query does a lot of computation and it’s possible to optimise it with a better query or simply by creating indexes. For example:
CREATE INDEX idx_bottle_products_name_tsvector ON bottle_products
USING gin(to_tsvector('english', name));
If you search with only one column:
SELECT name,
category,
ts_rank_cd(
to_tsvector('english', name),
to_tsquery('english', 'decanters')
) AS rank
FROM bottle_products
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'decanters')
ORDER BY rank DESC
LIMIT 20;
I got results in 50 to 80ms without index and 20 to 30ms with index. It’s important to say that you create an index for specific language. So if you have records that are in multiple languages, you need to create an index for each language. You can have multiple indexes for the same column. You just add a condition at the end. For example:
CREATE INDEX idx_bottle_products_name_english_tsvector ON bottle_products
USING gin(to_tsvector('english', name)) where language_id = 1;
CREATE INDEX idx_bottle_products_name_german_tsvector ON bottle_products
USING gin(to_tsvector('german', name)) where language_id = 2;
When you do search with these ts vectors with multiple columns it can be a little bit difficult to make a proper and fast query. Personally, I think that similarity shows better results than ts vectors, so I haven’t gone deeper. But again, it’s very dependent on your data. It may be much better in your cases.
In real life customers, visitors, users don’t search with just one word. Usually two or sometimes even more.
SELECT name,
category,
ts_rank_cd(
to_tsvector('english', name),
to_tsquery('english', 'decanters bottle')
) AS rank
FROM bottle_products
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'decanters bottle')
ORDER BY rank DESC
LIMIT 20;
=>
ERROR: syntax error in tsquery: "decanters bottle"
That’s because to_tsquery
allows you to use a single word or query. This will be a bit wild. You have several options what to do. You can see a lots of examples in the documentation:
tsquery
operators&
(AND),|
(OR),!
(NOT), and<->
(FOLLOWED BY)
Let’s look at two basic examples:
to_tsquery('english', 'decanters | bottle')
=> decanters OR bottle
to_tsquery('english', 'decanters & bottle')
=> decanters AND bottle
To move this problem to real code, let’s look at the updated function for making query:
text = "decanters bottle" |> String.split(" ") |> Enum.join(" & ")
weights = "{0.1, 0.2, 0.9, 1.0}"
language = "english"
limit = 20
offset = 0
from(bp in BottleProducts,
where:
fragment(
"setweight(to_tsvector(?::text::regconfig', ?), 'A') @@ to_tsquery(?::text::regconfig, ?)",
^language,
bp.name,
^language,
^text
) or
fragment(
"setweight(to_tsvector(?::text::regconfig', ?), 'B') @@ to_tsquery(?::text::regconfig, ?)",
^language,
bp.category,
^language,
^text
),
limit: ^limit,
offset: ^offset,
order_by: [
desc:
fragment(
"ts_rank_cd(?, setweight(to_tsvector(?::text::regconfig, ?), 'A') || setweight(to_tsvector(?::text::regconfig, ?), 'B'))",
^weights,
^language,
bp.name,
^language,
bp.category
)
]
)
|> Repo.all()
That’s a very small change:
text = "decanters bottle" |> String.split(" ") |> Enum.join(" & ")
But it’s not good because:
- input can be different and it can lead to wrong query
- it assume we want to use only & operator
Let’s see the first problem:
"decanters bottle" |> String.split(" ") |> Enum.join(" & ")
"decanters & & & bottle"
How can we fix it? It can be complex and handle more use cases but this is very simple solution:
"decanters |bottle & whisky "
|> String.trim()
|> String.downcase()
|> String.replace("\&", "")
|> String.replace("\|", "")
|> String.replace(~r/\s+/, " ")
|> String.split(" ")
|> Enum.join(" & ")
"decanters & bottle & whisky"
The point is to normalise the input, remove the chars which the query uses and do the actual query with word parsing. It’s very simple, probably can be done with one RegEx or another way. It’s up to you to make it work with your logic and skills.
It’s important to mention why I made String.downcase
. It’s not necessary but it’s nice to use some kind of common format and in the case of using caching search results, it can save a lot of hints when visitors enter something with capital letters.
The second problem is more complex, because if the search term consists of several words, we assume that it’s an “and” operator. So each search result must contain all the words. It’s very complex and really depends on the use of this search. I’m not going to write code on how to do it, because basically it can be done in a whole new article. I think there are a lot of articles about parsing queries and making logic. Very simple code which I suggest not to use in production, can be something like this:
"decanters or whisky"
|> String.trim()
|> String.downcase()
|> String.replace("\&", "")
|> String.replace("\|", "")
|> String.replace(~r/\s+/, " ")
|> String.split(" ")
|> Enum.join(" & ")
|> String.replace("& or &", "|")
"decanters | whisky"
It will only work with one operator, when you enter multiple operators it may cause create logical problems.
"decanters or whisky or glass and water"
|> String.trim()
|> String.downcase()
|> String.replace("\&", "")
|> String.replace("\|", "")
|> String.replace(~r/\s+/, " ")
|> String.split(" ")
|> Enum.join(" & ")
|> String.replace("& or &", "|")
|> String.replace("& and &", "&")
"decanters | whisky | glass & water"
You may have the problem that you will have products with similar names but really different meanings. For example chocolate and plate in chocolate colour. Similarity search will find similar parts and assume it’s similar, ts vectors as well. Maybe it will be lower in results but still. It will show plates when customer want to buy chocolate for girl.
Semantic Search
Big player in the game. Semantic search. Sounds modern, sounds like something with AI. Well, we can easily simplify it or define it as “using the meaning of words to search”. People, especially developers, realised a very long time ago that search uses human languages which are much more complex and not as precise as programming languages. You can read more about the history of “using meaningful search” in this nice article from Algolia (a really good 3rd party search provider):
You can read in the article AI search in 2007, vector search in 2013. So nothing to do with current (2024) hype about AI. These technologies are so old that they can already walk to school alone without their parents. But they still need to be guided and shown the right way to live.
It’s important to mention that Semantic Search it’s a slightly broader concept, not anexact way how to do it. Main goal is to use tools that take meanings of what search term want to achieve and execute it with understanding of content and context with relevant data.
For this purpose I will show the vector search. PostgreSQL is wondeful for using real data with vectors thanks to the pgvector
extension.
The definition what is a vector, or vector search is quite complex. I’m not the right guy to explain it properly. I will give you just very simple definition which I used to explain my non-technical colleagues.
First of all, the following visualisation of the vectors is really good.
I borrowed this image from already mentioned Algolia article, and they borrowed it from this article:
You can see words and connections between them in this picture. It’s in 3-dimensional space. Each dimension has its own meaning. It can be distance from North Pole, or 0 can be black 1 can be white and all other colours can be in between. It really depends on the author of this “space”. These spaces are called embedding models. I explain it to non-technical people as “GPS” coordinates in fake word. Where vectors are several numbers similar to latitude and longitude in our world. And model is map. So basically vectors are coordinates on map. Yes, it’s not correct and my university teachers would probably try to physically attack me for that definition, but it’s as it is, I’m a practical guy.
Nice example is model from OpenAI. They have model text-embedding-3-large with size of 3072 dimensions. This means that if you put some text into this model, it will return you an array/list of 3072 numbers. These numbers are basically coordinates in their world named “text-embedding-3-large”. They have other models too. You can read more about how this model is created and some comparisons in their article.
So let’s move on to the practical and technical. Using vectors for search consists of two parts:
- Obtaining and storing Vectors
- Calculations with Vectors
Obtaining and storing Vectors
First you need to know your data. Getting the right vectors is very important. It’s different for getting vectors for articles and for products or small text or big text. The problem with big or huge text like book or article is that it can contain a lot of meanings. One part of article can be about bad experience with A and second part can be about good experience with B. So embedding model will have hard work to define which vectors to use, where is the position on every dimension. Or product description can also contain tutorial how to cook with it. Thanks to this, the model can make a decision and understand that the meaning of this content is about this meal not about the product.
Context
For example for articles, books, documentation or basically any large text. It’s common practice to split text into several blocks and vectorize only those blocks. You will get better vectors thanks to smaller possible meanings and you can also address results to more details. For example instead of showing the result that your topic is mentioned in book A, you can show the result that your topic is mentioned in book A in chapter 23, paragraph 5.
If you have a product with description, you need to make sure that you don’t have any directly related text in it. How to cook with this product, history of the company that made this product, or whatever is not directly related to product itself. It’s important because we’re not in the state of models/AI (2024) where it can automatically define which context is main and important and which is irrelevant and should be ignored. It’s possible but you have to take the right steps first. For example you can take only a few of sentences from the description or use summarised description if you have one. Or you can use Other AI models to generate the right description from your big description. It’s important to focus on the right context.
Sometimes it’s not about about removing text, It’s about adding detail. Imagine you are an online shop for coffee. You have names of all products, brands, etc… You have a good search for products. But you also want to add a search for brands. So you want to use brand name and description, maybe location. Great. It will work, but it can work better. Imagine that some brands don’t just sell coffee, but also equipment to make coffee, some other ingredients or specialise in some special categories of coffee. So , for example, adding the top 5 categories of products that this brand sells in your store will help create a better context for the model to decide the right vectors.
Important note! Do not forget to remove html or other tags from your text. They’re not needed to get proper vectors and can waste some potential.
Even more context
So we have proper data, big text is broken down to small, product contains only directly relevant data, brands contains top categories. Great, are we done with proper context? NO!
Imagine that you just met some random friend and you tell that person:
- Dark Roasted Peru, Bean Lovers, sweat taste from South America
- Serrano Superior, Ebenica, exotic taste from Kuba
Well, it can go several ways, but probably the person will not understand what you want to say and the feedback will be different. So let’s give some context:
- I want to buy coffee Dark Roasted Peru from brand Bean Lovers, it should have sweet taste and it was gown in South America
- I want to buy coffee Serrano Superior by brand Ebenica. It should have exotic taste from Kuba.
I don’t think it needs any further explanation. It’s similar with machines. You have to give special context to tell machine that we are talking about products from the store, which part of the text is brand, which is the description, etc.
Let’s see some example:
"A #{product_category} product named #{product_name} from brand #{product_brand}. Description: #{product_description}"
"A brand named #{brand_name} selling #{top_categories} from #{brand_location}. Description: #{brand_description}"
If you have data in a different language or multiple languages, you should mention this in this “prompt” and the text itself can be in that language.
Ok, so basically we took very simple stuff and ame it a little bit more complex. I have bad news. I’m not done yet with the context.
Just as you need to add context to your data, you need to add context to search terms. Because the same as I wrote for products or brands. So to go on, there is anexample:
"Search for #{term_string} in coffee products or related equipment. Include related terms, brands, categories or similar products "
Ok, we have prepared text to get vectors, now we need to get vectors. You can use local offline models or try some 3rd party api. I use OpenAI, it’s easy.
@spec text_to_vector(String.t() | list(String.t())) ::
{:ok, list(float()) | list(list(float()))} | {:error, any()}
def text_to_vector(text) do
case OpenAI.embeddings(model: "text-embedding-3-large", input: text, dimensions: 2000) do
{:ok, %{data: [%{"embedding" => e}]}} ->
{:ok, e}
{:ok, %{data: embeddings}} ->
{:ok, Enum.map(embeddings, fn %{"embedding" => e} -> e end)}
{:error, reason} ->
{:error, reason}
end
end
It’s a simple use of the API, you can send a text or multiple texts and you will get vectors for them. I’m using this OpenAI library https://hex.pm/packages/openai
You don’t have to use a paid 3rd party service. You can for example use the great Elixir library Bumblebee and use some models from the wonderful Hugging Face. See code example in docs. It may look complicated, but basically it’s copy&paste and it really works.
Storage
Now we have vectors, what now? First you have to decide, based on your usage, whether you want to have a separate table where you store, for example product_id and vector itself. Or you will store them in the same table.
To store it in PostgreSQL, don’t forget to have the pgvector extension installed and enabled. They also have their own Docker image, so if you use Docker for your PostgreSQL yes, many people do ) you can use it. But more importantly, if you have running tests in your CI/CD, you can use this Docker image for example in the GitHub Action’s job.
services:
postgres:
image: pgvector/pgvector:pg16
So it’s easy to enable this extension if it’s available/installed on your server:
CREATE EXTENSION IF NOT EXISTS vector;
or Ecto migration version:
execute "CREATE EXTENSION IF NOT EXISTS vector;"
In order for your Elixir application to work with Ecto and vectors you need to have a new attribute type. You can create it yourself or use the pgvector hex package for it. Thanks to the great work in this package you have type and also query macros that you can use to make nicer queries, but we will talk about that later.
If you want to create a new table or add a new column, it’s easy with Ecto migration. You can also do it in your language, so it depends on your framework what you use.
add :vector, :vector, size: 2_000
First “vector” is the name of your field, you can use whatever you want. Second is a new type thanks to the hex package and PostgreSQL’s extension. Size is the number of dimensions for this vector. It’s important. It has to be the same size as the output of your model. My example uses OpenAI’s api which allows to specify number of dimensions in output. In case you can not change it as easy as me. There are some code examples on internet how to do it by yourself. It’s out of my current “vector’s knowledge” so I’m not going to play with it.
I could use the default dimension size, but the current PostgreSQL extension only supports indexing only up to 2000 dimensions for vectors. See more details on their GitHub Page in Index section.
Creating index on this field is important for me. It makes query much faster, it’s probably because of large number of dimensions than number of entries in table. There are more types of indexes, read about them and choose what you think is best for your data. Or just try it. My winner is HNSW.
Creating an index is easy:
CREATE INDEX CONCURRENTLY your_table_vectors_vector_idx
ON your_table_vectors
USING hnsw (vector vector_ip_ops);
But there is a very important part vector_ip_ops. It’s the name of the distance function you want to use for queries. I use OpenAI’s model and in their documentation it is suggested to use function for inner product distance function instead of cosine distance. So based on your data you have to decide which function to use for your queries. It’s easy to change, you can even play around with it later. But don’t forget that if you have a lot of data, indexing can take a long time.
If you really have a lot of data, you should read this article about the new pgvectorscale extension from the Timescale team.
One important note when you store vector in a table with other data. When you use Elixir Ecto schema, you really don’t want to load vector every time when you work with that product/schema. It can be a problem because imagine that you have products and every time when you show product on index page or detail page, your system will load all vectors for every product. You really don’t want that. Ecto has an incredible solution for it. Just add this option to your schema definition and it’s done.
field :vector, Pgvector.Ecto.Vector, load_in_query: false
It will ignore this field when Ecto loads the record from the database. You have to explicitly ask for it to load it. Thanks to this, you don’t have to worry about downloading all vectors, parsing them and loading all vectors during regular operations. You may not use these vectors in your application at all, only in the database. I do it that way. Of course only creating them and storing them in the database. But this can and should be done asynchronously. Some services like Supabase or some extensions support it directly from database. You just put the entry in the database and it will call the relevant api and store it in the database for you. Is it good? Hard to say, I don’t like the idea of having that kind of logic in the database. But it might be great for someone. Timescale has published a big tutorial on how to use these tools.
The output of the model, the vector, is basically a list of numbers. So no further steps are required. Just store it in your database as returned by the model’s api.
Calculations with Vectors
Well, I say calculations because it’s really about calculations, but it’s really about searching.
First, some important technical notes. If you use Index like HNSW for your vector’s column. There are some important information. By default index only returns first 40 results only. You can read about this on pgvector’s GitHub. You need to change configuration for your database or whole server. This depends on your PostgreSQL instance. So you can do it on your current database:
ALTER DATABASE your_database SET hnsw.ef_search TO 750;
Or you can add it to the configuration for all databases, as I did locally for the PostgreSQL installed by Homebrew.
# /opt/homebrew/var/postgresql@16/postgresql.conf
hnsw.ef_search = 750
I have a configuration for 750 results. It’s a lot, but I have a reason for it. The usual approach is to make a query that returns you products (movies, names or other entries which you have) and in this case you don’t need 750 results. But if you need to have more complex filters, the query can get really big and complex. Complexity was one of the more reasons why I went with the multiple queries approach:
- Query to check the existence of a search term in our database
- Query that returns search results with ONLY ids
- Query that returns actual results which are used for visitors
So you can see step number 2 where the database query returns only ids. And later I do my business logic filtering and using these results. So even if I need 10 products/results, my business logic, filters, sorters will know much more results than 10. Because maybe the first 100 of those results are not suitable for results. Imagine category filters, price filters, tag filters, delivery locations, etc…. Based on some tests for me number 750 was fine for me. I used a technique where I searched with many more results and checked results at the bottom. When the didn’t make sense I went -50% off that number. I searched again, checked at the bottom, if the last results made sense I added +50%, etc.
Query to check existence of a search term in our database
First of all, remember, that this type of search doesn’t use pure pure input from the html input field for the actual search. It uses this input as text which is transformed into vectors. The transformation can be done locally or with 3rd party API. But it doesn’t matter, it will still be slow and more importantly, customers don’t search for unique search terms. It’s common for them to search for the same things quite often. So why should we create vectors for the same search every time, even if we have multiple visitors searching for same thing. So that’s the reason why I store search term with vector. So first search is “slow”, because it has to get proper vector thanks to nice prompt which I mentioned before, and store it in database. After that I don’t use this vector at all. I use search term query which returns relevant vector as a subquery.
But the point is that I have to check the database to see if anyone has already searched for this search term. If not, I have to get the vector from the model and store it in the system.
SELECT TRUE FROM "search_terms" AS s0 WHERE (s0."term" = 'banana') LIMIT 1
from(st in SearchTerm, where: st.term == ^string)
|> Repo.exists?()
To be honest, I don’t use this query. I use a slightly more complex one that I use in several places.
from(st in SearchTerm,
where: st.term == ^string,
limit: 1,
select: st.vector
)
If you call it with Repo.one(term_query)
it will give you vector, but you don’t need it. When you call it with Repo.exists?(term_query)
, it will remove select part of query and add TRUE as you can see in SQL example before. But I use same query as subquery for search itself in next step.
It’s important to say, that search term, string from html input must be normalised. It will save some pain in the future. For example just simple one:
string
|> String.trim()
|> String.downcase()
|> String.replace(~r/\s+/, " ")
When I know that I have normalised the search term in the database with the correct vector. I can proceed to the next step. This step is a compromise. It costs us a few of ms in the query, but thanks to it we can be sure that our search doesn’t (falsely) return zero results or that we don’t call a 3rd party API for every search.
Query that returns search results with ONLY ids
Well, headliner of our party. The main query that performs the actual search. To be honest, it’s boringly simple. First you have to choose which distance function you want to use. Let’s use “inner product distance” which is suggested by my embeddings/vectors model (OpenAI).
So first of all, in the previous step, we have the search term check and we have term query. We’re going to use that as a subquery.
from(
s in subquery(
from(pv in ProductVector,
as: :product_vector,
order_by: [asc: selected_as(:distance)],
select: %{
product_id: pv.product_id,
distance:
selected_as(
fragment("(? <#> ?)", pv.vector, subquery(term_query)),
:distance
),
rank:
selected_as(
1 / (60 + fragment("(? <#> ?)", pv.vector, subquery(term_query))),
:rank
)
},
limit: 350
)
),
as: :search,
where: s.distance < -0.4,
select: %{product_id: s.product_id, rank: s.rank, distance: s.distance}
)
There is an interesting function from Ecto, selected_as
. Basically two functions, thanks to it it’s possible to use value from select part in order by for example. It’s common in sql but not easy at all in Ecto or functional language. More about it in the docs.
SQL version:
SELECT s0."product_id", s0."rank", s0."rank"
FROM (SELECT sp0."product_id" AS "product_id",
(sp0."vector" <#> (SELECT sss0."vector" AS "vector"
FROM "search_terms" AS sss0
WHERE (sss0."term" = 'chocolate')
LIMIT 1)) AS "distance",
1 / (60 + (sp0."vector" <#> (SELECT sss0."vector" AS "vector"
FROM "search_terms" AS sss0
WHERE (sss0."term" = 'chocolate')
LIMIT 1))) AS "rank"
FROM "product_vectors" AS sp0
ORDER BY "distance"
LIMIT 350) AS s0
WHERE (s0."distance" < -0.4)
It’s not that simple, you could say I’ve over-complicated the query. Yes, but wait for the explanation.
The index on the vector column is only used if the limit is used.
To avoid getting very weird results at the end, we need to filter results based on distance. The subquery where I later filtered out results with a distance greater than -0.4 was faster. At that time index was not used when I used filtering inside.
The vectors are very large, there are 2 000 (1 999) numbers in the list. So downloading it to your application and immediately uploading it as part of a query is a waste of resources. So instead you can use a query that returns directly in the database within your main query. That’s the reason for subquery with term_query.
You can use the macro max_inner_product(column, vector)
from Elixir’s pgvector
package instead of fragment. It may look nicer this way but in the end it’s still fragment and I could not get it to work with subquery instead of vector.
The calculated distance is a negative number. The range of this number is still between 0 and -1. These distances can be very different and it really depends on your data, the model you are using and the prompt you are using to get these vectors. I got a big difference between distances on the same data, with the same query when I changed the prompt that generates vectors from OpenAI’s model. Before the change I had numbers like -0.2345 and now it’s much lower around -0.1292949494949 (just example numbers).
Using the same approach as when I found 750 good number for my data, I found out that I can remove all results that have a distance lower than -0.4. This means that if the distance is -0.5 it is lower and I don’t want to see it. So -0.1 is a great result and -0.9 is terrible.
I also calculated the rank. That’s because I want to use this query for hybrid search as well. It is described in the next chapter. There is a very simple calculation called Reciprocal Rank Fusion (RRF). Thanks to this approach you can merge results from different (but similar) ranges and compare them.
So this query returns sorted results containing product_id
, distance and ranks. You don’t need to export distance or rank in production, but it’s great in development to have all the pieces configured correctly. Or later when you want to merge results with other queries.
Basically that’s all, the output of with ids needs to be used in your search and filter logic. Here is a very simple example:
ids = VectorSearch.search(term_string)
from(p in Product,
where: p.id in ^ids,
order_by: fragment("array_position(?, ?)", ^ids, p.id)
)
|> Repo.all()
SELECT *
FROM products p
WHERE p.id = ANY ('{1,2,3,4,5}')
ORDER BY array_position('{1,2,3,4,5}')
The VectorSearch.search(term_string)
function returns a sorted list of product_ids from the semantic search. So this list of ids is used to filter only products from found product ids. The last step is the most important one. The results are sorted by the position of the product id in this list from the semantic search, thanks to the super nice function array_position
. It’s a very simple query to show how it works. You can add more complex filters. In the project I mentioned earlier, we changed the similarity search to semantic search. This means that we split the query from one to two. The first one returns results with product ids and the second one does filtering and other stuff. Basically we changed one line of the query where we called the similarity function to a function that filters products by id and sorts them by array position. I didn’t expect it but old big second query went much more faster thanks to product id filtering which is properly indexed. So we added two more queries but the actual response time of the search is now almost the same and in some cases even faster than before without vectors. Another side effect of this change is that you can change the logic there in the future with minimal changes. You can do the search in 3rd party service and you don’t have to implement complex filtering in their system. You just do a search that returns product ids and the complex filtering will be in your PostgreSQL database without any changes.
NOTE: You can see that I haven’t added more sql examples and data examples in this part of the article. The reason is that I don’t have a good database (lots of real data) that I can share publicly. If I find that kind of data, I will update this article later.
Hybrid search
I have to mention that I am pretty sure that in a couple of years, the term or concept of “Hybrid search” will be renamed or merged with Semantic Search. And nobody will use only vector search or something like that. Because basically hybrid search does what the word “semantic” means by English dictionaries.
I would like to mention that there is a nice Elixir Hex package for Hybrid Search in PostgreSQL vecto by cpursley (Chase Pursley).
Semantic search with “fulltext” search. Again, there is no exact definition that forces you to use an exact approach for this. You can use ilike
, similarity
or ts_vector
to make it work. But the logic is the, same, using vectors along with a more precise technique. I see it as a very good practice to help you solve problems with black box vectors generated by some model. Yes, if you have your own process to generate proper vectors for your data, you may not need this hybrid combination.
Basically, we need to combine one from the “full text” approaches from previous chapters with semantic search with vectors into one big query. Sounds simple, it is, until you want to order the results properly.
There is a problem, which I have already mentioned. Almost all the approaches mentioned above have their own way of sorting. That’s why I introduced RRF in semantic search query. Thanks to this, it’s possible to unify the ranking of the results on a scale, so we can merge the results and these ranks to one result table and sort it. So you can have mixed results.
Query is much bigger and can feel very difficult to read, but if you take it step by step it’s easy.
similarity_query =
from(p in Product,
where: fragment("SIMILARITY(?, ?)", p.name, ^string) > 0.17,
order_by: [desc: selected_as(:rank)],
limit: 350,
select: %{
product_id: p.id,
rank:
selected_as(fragment("(1/ (60 - SIMILARITY(?, ?)))", p.name, ^"%#{string}%"), :rank),
distance: selected_as(fragment("SIMILARITY(?, ?)", p.name, ^"%#{string}%"), :distance)
}
)
term_query =
from(st in SearchTerm,
where: is_nil(st.term),
limit: 1,
select: st.vector
)
vector_query =
from(
s in subquery(
from(psv in ProductVector,
as: :product_vector,
order_by: [asc: selected_as(:distance)],
select: %{
product_id: psv.product_id,
distance:
selected_as(
fragment("(? <#> ?)", psv.vector, subquery(term_query)),
:distance
),
rank:
selected_as(
1 / (60 + fragment("(? <#> ?)", psv.vector, subquery(term_query))),
:rank
)
},
limit: 350
)
),
as: :search,
where: s.distance < ^maximal_inner_product_distance,
select: %{product_id: s.product_id, rank: s.rank, distance: s.distance}
)
from(s in subquery(union_all(similarity_query, ^vector_query)),
order_by: [desc: s.rank],
select: {s.product_id, s.rank}
)
|> Repo.all()
|> Enum.uniq_by(&elem(&1, 0))
Before we look at SQL version, let’s take a deep dive. We have two main queries:
- similarity query
- vector query
Both have the same result structure:
product_id
distance
rank
Distance is basically a number that was the result of one of the functions used for a specific search. As mentioned before, it can be in different ranges and the same number doesn’t mean that it is the same in the context of both search results.
Rank is very important, thanks to the RRF calculation we can compare and sort both results together. Basically thanks to this calculation, the differences between these numbers are minimised, so that the numbers are not equal but, similar.
If we have the same number for the ranking, it is easy to merge these two queries is simple thanks to UNION
.
SELECT s0."product_id", s0."rank"
FROM (SELECT sp0."id" AS "product_id",
(1 / (60 - SIMILARITY(sp0."name", '%chocolate%'))) AS "rank",
SIMILARITY(sp0."name", '%chocolate%') AS "distance"
FROM "products" AS sp0
WHERE (SIMILARITY(sp0."name", 'chocolate') > 0.17::float)
UNION ALL
(SELECT ss0."product_id", ss0."rank", ss0."distance"
FROM (SELECT ssp0."product_id" AS "product_id",
(ssp0."vector" <#> (SELECT ssss0."vector" AS "vector"
FROM "search_terms" AS ssss0
WHERE (ssss0."term" = 'chocolate')
LIMIT 1)) AS "distance",
1 / (60 + (ssp0."vector" <#> (SELECT ssss0."vector" AS "vector"
FROM "search_terms" AS ssss0
WHERE (ssss0."term" = 'chocolate')
LIMIT 1))) AS "rank"
FROM "product_vectors" AS ssp0
ORDER BY "distance"
LIMIT 350) AS ss0
WHERE (ss0."distance" < -0.4))
ORDER BY "rank" DESC
LIMIT 350) AS s0
ORDER BY s0."rank" DESC;
The results can look like this:
+----------+--------------------+
|product_id|rank |
+----------+--------------------+
|11833 |0.01694915254237288 |
|4394 |0.016867469884363277|
|892 |0.016867469884363277|
|6552 |0.01685393258991333 |
|2345 |0.01685393258991333 |
|9035 |0.01685393258991333 |
+----------+--------------------+
All these queries are fast in PostgreSQL thanks to indexes and some approaches. But still, don’t forget that you have to call a 3rd party API which can be down or take too long to return vector for the search term. I have already written about this. But depending on your use case, it may even be possible to store product ids for that stored search term. Basically, when a product is updated, you need to update vectors for that product and in that case you need to expire the cache. So maybe you can just remove the stored product_ids for that search term. It all depends on the data what you have and the code what you run.
3rd party services
If you can and want to use 3rd party services for search, there are a few options which I would like to mention that I would consider as a good solution.
Supabase
It’s an open source project that can be self-hosted or used as a paid cloud solution. It’s a whole ecosystem built on PostgreSQL with a lot of features. They have also some solutions for Hybrid Search.
Meilisearch
It’s another open source with a similar cloud model to Supabase. But it’s built on different technologies and with focus directly on search. This model is based on a server that needs to be fed with data to be usable. There is a huge development in modern search approaches and I really like it. For me, if I wanted to have “only” search, this would be the first option. They have made good progress recently with Hybrid search. It’s also important to mention that they have some news about Federated Search , which may be a topic for a new article of mine.
ParadeDB
This is another open source solution that has recently been supported by some of cloud managed database providers. Basically it’s PostgreSQL with some extensions and custom “indexing” logic. It’s new to me, but it looks really promising. For example Tembo. Here is nice article on how to use this database with Elixir and Ecto.
Timescale
I have mentioned this company several times. They really do a great job with PostgreSQL and open source extensions and modern AI requirements. They have a managed cloud service that supports their extensions, which can really improve search. For example, they have an extension pgai
which automatically loads vectors from a selected api and stores them in the database. So you don’t need to have that logic in your application. To be honest I don’t like that kind of logic in a database layer, but it might work for someone else. See nice article about vectors from Timescale.
Final note
This article has taken me a little bit longer than I expected because I wanted to mention all the points, steps I had to take. It’s not perfect and there’s a lot of room for improvement. I will try to implement all relevant feedback what I will get, so do not hesitate to contact me in if you see a bug or have a suggestion for improvement.
Links
There are a lot of people who are smarter than me and have more experience or time for the topics I mentioned in this article. All these people write great articles focusing on more details. I will try to add links to these articles in the future to keep this as a fresh source of information.
Added 16.10.2024:
Thank you for reading and have a nice day
If you want to get in touch with me or stay informed about Elixir or web development news, you can find me on X/Twitter.