Reverse Engineering Elasticsearch Highlights

Published in

The Startup

7 min readDec 7, 2020

Elasticsearch is a full text search database built on top of Lucene. It’s got some amazing features including a built in English language analyzer and a search term highlighter. Both of these features are incredibly useful, however, some information is lost when you use them together. This makes it difficult to figure out why your query matched a document. Fortunately, I have come up with a method to recover this information (in most circumstances). This may be of interest to you if you want to see full text search in action or if you are struggling with a similar problem.

Language Analyzers & Highlights

Elasticsearch is wonderful because its english language analyzer lets you find documents that match your query even if the text does not match exactly. If I am interested in finding documents about “new technologies” Elasticsearch will return documents that mention “new technology”. The simple stemming behind the scenes saves you a lot of time thinking up all the possible combinations of queries you need to find the content you want. The other wonderful thing about Elasticsearch is that you can get it to highlight the words that match your query string within the containing document. Just like you see on google!

Highlighted conjugated search terms (in bold)

Highlighting is great because it allows users to see why the document matched their query. This is really important when the query string is very long and there are only partial matches. For example a query string like fabulous new technologies emerge might return results as below:

Although it is very easy for a human to read the bolded text and figure out why the result matched their query, it’s not so easy to do algorithmically. Since the word “technologies” does not exactly match “technology”. It would be nice if we could have some mapping from the keywords in our query and the bolded words in the snippet. By default Elasticsearch does not return such information. The response from Elasticsearch for the snippet above will be something like this:

Many see <em>emerging</em> <em>technologies</em> as a solution vector for the global challenges of the twenty-first century. ... Distribution of Micro-Nano <em>Technology</em>

So the first problem is that all the tagged words are wrapped with a generic  tag. And since the words may be conjugated (or processed in many other magical ways by Elasticsearch’s other analyzers) they will not necessarily match any of the keywords in our query. Thus we do not know which keyword maps to which -tagged word.

Mapping the highlight to the query

Fortunately Elasticsearch does provide some assistance in this matter via the tag_schema highlighter option (you need to be using the fast vector highlighter to make this work). If you set this option to styled you will get additional information with your tags. You will see tags of the form:

<em class="hlt1">, <em class="hlt2">, <em class="hlt3">, ...

Unfortunately, it’s not at all clear from the documentation how these tags map back to the keywords in our query. I have found a few posts on stack overflow on the matter, but non have a clear answer. So, let’s learn by experimentation.

If we use the FVH highlighter with a query string:

fabulous new technologies emerge

and try to match it to the text:

Many see emerging technologies as a solution vector for the global challenges of the twenty-first century. ... Distribution of Micro-Nano Technology

we are going to get:

Many see <em class="hlt4">emerging</em> <em class="hlt3">technologies</em> as a solution vector for the global challenges of the twenty-first century. ... Distribution of Micro-Nano <em class="hlt3">Technology</em>

What the hell is going on here? Let’s take a closer look at the tagged text to understand this better. The three tagged words (in order of appearance) are:

emerging
technologies
Technology

According to the docs we should expect the class of the tag to follow htl1, htl2, htl3... etc. We can see these tags make sense if we look at the order of keywords in our query string

Mapping the highlight tags to the query string

Now it seems we have a way to trace back the highlights to our very basic query! But let’s not stop here. Elastcsearch provides many powerful tools to express your query in a more specific way. Let’s see how we can use this with more features like exact matches, proximity searches, and boolean queries.

Things get a little bit more complicated if you are looking for exact matches of phrases. Eg, if you use quotation marks to specify an exact match. For example, let’s look at the query string:

fabulous "new technologies" emerge

and match the text:

The new technology has emerged!

resulting in:

The <hlt2>new technology</hlt2> has <hlt3>emerged</hlt3>!

I have abbreviated  to <htl1> for the sake of brevity.

We see that the quoted keyword "new technologies" has been treated as the 2nd phrase to tag - thus it gets assigned hlt2 . This time our query string maps to the following tags

Mapping the highlight tags to a query with an exact match

This same rule will apply if we introduce a proximity search into our query string, such as in the following:

fabulous "new technologies"~4 emerge

and match this to the text:

The new hot technology has emerged!

resulting in:

The <hlt2>new</htl2> hot <htl2>technology</htl2> has <hlt3>emerged</htl3>!

So we see the tag order was preserved from the previous example.

Mapping the highlight tags to a proximity search

Highlights & Boolean Queries

The query is rarely a simple query string. A query may take the form of a nested boolean query, such as the following example

{
  "query": {
    "bool": {
      "should": [
        {"match": {"text": "fabulous"}},
	{"match": {"text": "new"}},
      ],
      "must": [
        {"match": {"text": "technologies"}},
	{"match": {"text": "emerge"}},
      ]
    }
  }
}

If we match this query to the text:

The new hot technology has emerged!

It will produce the following highlight:

The <hlt4>new</htl4> hot <htl1>technology</htl1> has <hlt2>emerged</htl2>!

What the?? The order of the tags (hlt1, hlt2, hlt3, ...) is following an interesting pattern

{
  "query": {
    "bool": {
     "should": [
      {"match": {"text": "fabulous"}},     [hlt3]
      {"match": {"text": "new"}},          [hlt4]
     ],
     "must": [
      {"match": {"text": "technologies"}}, [hlt1]
      {"match": {"text": "emerge"}},       [hlt2]
     ]
}

We see here that the must field starts the indexing from hlt1 and then the tag index picks up again from the should field.

This gets slightly more complicated when you go to a nested boolean query, but the same rules apply. Using these rules, I’ve managed to create a parser that reads in an elasticsearch query object and returns a mapping between the keywords and the expected hlt tag associated with them. This works equally well if you are using the query_string query within a boolean query. Just make sure you follow the rules we explored in the previous section (and make sure to declare a single default_field per query string argument). If you are struggling with this problem please get in touch, I have some code that might help. For more sophisticated queries, including fields unrelated to the highlighting, I have found that using the highlight highlight_query parameter avoids this potential ambiguity.

The Caveat

Unfortunately all these nice rules go out the window if you use a wildcard within a query or query string. The query string New tech* will result in the following text:

I love technology. The various technologies of the 21st century.

being highlighted as so:

I love <htl2>technology<htl2>. The various <htl3>technologies<htl3> of the 21st century.

We see here that Elasticsearch has decided to give the word “technology” the tag htl2 and the word "technologies" htl3. These are treated as separate highlight tokens. This may be all well and good when the wildcard is at the end of the query string (as any tag index above htl2 should refer to the tech* token in our query string). However, when the wildcard appears in the middle of the query string (or in the middle of a boolean query) then we have no way of knowing which tag belongs to which. I am quite perplexed by this problem. If you have found a way around this please let me know!

Future investigations

In a future investigation I will dive into the open source repo to see how things are done under the hood. Perhaps I will also investigate how this can be accomplished using Lucene. I hope there is a more straightforward way of doing it. I will post again if I figure it out!

Edit: I got a suggestion to try out the annotated text highlighter to handle wildcard queries. I will try this out and report back.

We’re hiring!

If you find working on search an interesting problem, then you might find our company, QuantCopy, a great place to work. We are working on some really hard problems around information retrieval and QA. If you want to get involved please let me know! You can reach out to me at jack{at}quantcopy{dot}com. We are hiring engineers based in GMT pm 4hrs. We are opening an office in London in 2021.

The Startup

Reverse Engineering Elasticsearch Highlights

Language Analyzers & Highlights

Mapping the highlight to the query

Highlights & Boolean Queries

The Caveat

Future investigations

We’re hiring!

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in The Startup

Written by Jack Hodkinson

No responses yet