Most search models in use these days are based on lexical similarity, or how many important words overlap. A more accurate model is based on semantic similarity, or how much abstract meaning overlaps.
Semantic similarity is based upon transformer models, a type of deep learning model, that creates an embedding to represent each document’s semantic meaning.
Lexical similarity is great if you know the exact keywords you are searching for, but brittle if you do not. Take two examples:
- Obama speaks to the media in Illinois.
- The President greets the press in Chicago.
For anyone familiar with US politics, the first example essentially says the same thing as the second – in other words, they are semantically (conceptually) similar. However, the important words in the two examples don’t match up. So the lexical similarity won’t return the same results.
A semantic similarity approach matches up the words Obama and President, media and press, Illinois and Chicago, and speaks and greets.
I hadn’t considered that those models measure similarity between two words, but not between two sets of words, called documents in the parlance:
As it turns out, current state-of-the-art language models are good at measuring the similarity between two words, but not great at measuring the similarity between two documents. We had to perform a considerable amount of R&D work to develop a transformer model that could create document embeddings—we hope to go into the gory details of this work in future technical posts.
One essential trick was to use word mover distance to create labels for pairs of documents in an unsupervised manner—so that our model could learn how to map a document’s word embeddings into a single document embedding. But, for now, the example above gives you the high-level idea behind our approach.