Vector search for dummies

3 September 2023

Recently I built a system that uses vector search to logically truncate long documents and retain the most significant parts according to some search term. I’m a dummy, with no background in machine learning or mathematics, so there were new concepts for me to understand and implementation details to figure out. This post summarises what I learned.

What is vector search?

Vector search is a way to find stuff by what it means, not what it says. Conventional search works by looking for similarity between text in a query and text in the search space. Vector search works by mapping text to an alternative representation that models similarity of meaning and then searching those alternative representations instead of the text itself. The alternative representation is called a vector.

For a very basic example, say you search for “dog”. A document about spaniels that doesn’t include the word “dog” would be found by vector search, but not by conventional search.

The relationships that vector search can identify are surprisingly powerful; questions can be matched to their answers, translations can be made between different languages, cultural references can be interpreted.

What is a vector?

A vector is an array of floating point numbers, representing a position in n-dimensional space. For example, if you’re programming a 2d game, the x and y co-ordinates are a 2-dimensional vector. Whereas if you’re programming a 3d game, the x, y and z co-ordinates are a 3-dimensional vector.

The vectors that you’ll encounter in vector search are much larger than 2 or 3 dimensions, but the principle is the same. The length of any vector is its dimensionality, so a vector of length 512 is said to have 512 dimensions.

But what do vectors have to do with search?

Consider the 2-dimensional x and y co-ordinates mentioned above. We can measure how far an object is from a given position using those co-ordinates and some basic geometry you probably remember from school. Pythagoras’ theorem a² + b² = c² establishes that you can calculate the distance between two points on a plain by summing the squares of their distance along orthogonal axes, x and y in this case, then taking the square root. The same formula can be applied to vectors with greater dimensions and this is one way to measure vector similarity, called Euclidean distance.

Fundamentally, search is the process of finding items that are closest to a query. By mapping text to vectors so that similar meanings or ideas produce similar vectors, we turn semantic search into a mathematical problem.

How are vectors generated?

At this point, you might be thinking that reliably generating vectors for any text sounds like the hardest part of all this. And you’d be right, but fortunately it’s a solved problem.

Vectors can be generated using a large language model (LLM), the same technology that powers AI assistants like ChatGPT. LLMs are able do this because vectors are a data structure they use internally to model language. In LLM parlance, vectors are often referred to as embeddings. I’ll use those two terms interchangeably for the remainder of this post.

Of course, the models that are used for chat completion are not the same ones used to generate embeddings. Instead there are specific models built especially for the purpose of turning text into vectors. There are many examples of these models out there, with different tradeoffs relating to size and speed. I recommend building an architecture that lets you plug in different models easily, so you can play with various options and get a sense of what’s best for your application.

Each model will always generate the same dimensionality of vector, regardless of input text size. So it follows that embeddings are lossy; if you turn a very large document into a much shorter vector, you’ve lost some information. For this reason, it’s important to break long documents into chunks before sending them to your embedding model.

The size of the chunks will directly affect the accuracy and precision of your search implementation. There are no hard and fast rules concerning the “best” chunk size, so experiment aggressively within the context of your application. I’ve seen it suggested elsewhere that a reasonable starting point is 4kb chunks, with a 10% overlap to preserve context around breakpoints. Depending on your application, you may also want to track the origin of each chunk within its parent document; e.g. chapters in a book or sections in a page.

How are vectors searched?

For vector search to work, all of the documents from your search space must be converted to vectors and added to an index. With that in place, you can convert each search query to a vector and then find the items closest to it.

Indexing and search are typically implemented by a data store or search engine, although there are in-memory implementations available too. Picking the right implementation depends entirely on your usage. If you’re working with ephemeral data of moderate size, an in-memory implementation might suit you best. If you need persistence and already have a data store in your backend, it may support vector search either in its core or via a plugin. There are also dedicated vector databases you can use.

I’m deliberately not identifying any specific data stores or libraries in this post, for two reasons:

I’m not an expert and there are far more qualified opinions than mine that can offer good advice.
The state of the art is evolving quite rapidly and whatever I write here will likely become outdated very soon.

That said, in my experience most implementations support multiple different algorithms and have many options you can tweak to configure performance. Picking the right implementation is probably more about fitting nicely with your existing infrastructure, than it is about agonising over the “right” one.

What are some common search options?

There are two main concepts at play when it comes to searching vectors.

The first one is distance, which I discussed a bit earlier. I mentioned Euclidean distance, which is the simplest way to measure vector similarity but not the best at finding similar text, because it can be thrown off by magnitude differences when short queries are compared to much longer chunks. Instead, cosine similarity is a better metric for search, because it only measures the relative angle between vectors.

The other main concept to consider is how to identify nearest neighbours. In other words, given a search space populated with many vectors, how do you find the closest ones according to your distance metric?

Of course, you can just iterate through the entire space and collect nearest neighbours as you go. This is the brute force approach and works well for small datasets, or if indexing latency is more important than search latency. When there are many documents to search though, it may take too long.

Alternatively there’s a family of algorithms collectively referred to as approximate nearest neighbour (ANN). Some of these involve constructing hash tables to optimise the process, others divide the search space into subgraphs and collate the results from traversing each one, still others follow an iterative approach of gradually partitioning the data until nearest neighbours are found, and there are hybrid algorithms that combine elements of all three. The cost of these ANN approaches is slower indexing and the payoff is much faster search operations. One of the fastest is called hierarchical navigable small world (HNSW), which is graph-based.

The suggestion I made about picking a model applies here too: experiment with different algorithms using data that is representative of real-world workloads for your application. This stuff is easy to measure and those measurements will be more informative than anything a random blog post from the internet can tell you.