This post is older than a year. Consider some information might not be accurate anymore.
Used: elasticsearch v7.1.1
In the previous article, we look into the possibilities of prefix queries to create suggestions based on existing data to enhance the search experience. We experience how fast and straightforward it could help us in the beginning. We also learned that it has some drawbacks like latency and duplicates if the data-set grows more significant over time. In this article, we are going to overcome the problems with Edge NGram Tokenizer.
TOC:
Stats for Nerds
- The most played song during writing: Los Angeles by The Midnight
- Time spent writing: 54 minutes
- Photo by Émile Perron on Unsplash
- We use Elasticsearch v7.1.1
Edge NGram Tokenizer
This explanation is going to be dry .
The
edge_ngram
tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.
Source: Official reference
The default behaviour is:
With the default settings, the edge_ngram
tokenizer treats the original text as a single token and produces N-grams with minimum length 1
and maximum length 2
:
GET _analyze
{
"tokenizer": "edge_ngram",
"text": ["Los Angeles","Love","Paris","Pain"]
}
These examples create the terms:
[L, Lo, P, Pa]
For autocompletion, it needs adjustment. Did you mean Love
or Los Angeles
when you type Lo
? So we have to enlarge the maximum length. The terms are case sensitive. Most users don’t start with capital letters, so we need to lowercase the terms.
Define Autocomplete Analyzer
Usually, Elasticsearch recommends using the same analyzer
at index time and at search time.
In the case of the edge_ngram
tokenizer, the advice is different. It only makes sense to use the edge_ngram
tokenizer at index time, to ensure that partial words are available for matching in the index.
That’s why Elasticsearch refers to it as Index-Time Search-as-You-Type method.
We define the index wisdom
to store quotes. The field quote
has an index analyzer and a search analyzer.
PUT wisdom
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"quote": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Let us analyze the quote by Damian Conway.
POST wisdom/_analyze
{
"analyzer": "autocomplete",
"text": "Documentation is a love letter that you write to your future self."
}
This results in these terms:
"do","doc","docu","docum","docume","documen","document",
"documenta","documentat","documentati","documentatio","documentation"
"is","lo","lov","love","le","let","lett","lette","letter",
"th","tha","that","yo","you","wr","wri","writ","write",
"to","yo","you","your","fu","fut","futu","futur","future","se","sel","self"
Now we index the quote.
PUT wisdom/_doc/112
{
"quote": "Documentation is a love letter that you write to your future self."
}
Now we have the benefit of using a simple match query with fuzziness.
GET wisdom/_search
{
"query": {
"match": {
"quote": {
"query": "let luve wrote",
"operator": "and",
"fuzziness": 2
}
}
}
}
We get our quote back:
{
"hits" : [
{
"_index" : "wisdom",
"_type" : "_doc",
"_id" : "112",
"_score" : 1.6925297,
"_source" : {
"quote" : "Documentation is a love letter that you write to your future self."
}
}
]
}
Conclusion
This approach is fast for queries and has no significant impact on large data-sets, but may result in slower indexing time and higher disk space consumption. The inverted index needs to store more data. We highly recommended reading the Definitive Guide, as there are additional examples, e.g. for zip codes. With the proper setup, this method might satisfy your autocomplete needs.
Elasticsearch offers a third alternative with completion suggesters which provides top-notch performance but requires more memory.
When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.
Source: Edge NGram Tokenizer Reference
We are going to look into suggesters in the next article.