Existing by coincidence, programming deliberately
Recently I built a system that uses vector search to logically truncate long documents and retain the most significant parts according to some search term. I'm a dummy, with no background in machine learning or mathematics, so there were new concepts for me to understand and implementation details to figure out. This post summarises what I learned.
Vector search is a way to find stuff by what it means, not what it says. Conventional search works by looking for similarity between text in a query and text in the search space. Vector search works by mapping text to an alternative representation that models similarity of meaning and then searching that alternative representation instead of the text itself. The alternative representation is called a vector.
For a very basic example, say you search for "dog". A document about spaniels that doesn't include the word "dog" would be found by vector search, but not by conventional search.
The relationships that vector search can identify are surprisingly powerful; questions can be matched to their answers, translations can be made between different languages, cultural references can be interpreted.
A vector is an array of floating point numbers, representing a position in n-dimensional space. For example, if you're programming a 2d game, the x and y co-ordinates are a 2-dimensional vector. Whereas if you're programming a 3d game, the x, y and z co-ordinates are a 3-dimensional vector.
The vectors that you'll encounter in vector search are much larger than 2 or 3 dimensions, but the principle is the same. The length of any vector is its dimensionality, so a vector of length 512 is said to have 512 dimensions.
Consider the 2-dimensional x and y co-ordinates mentioned above.
We can measure how far an object is from a given position
using those co-ordinates
and some basic geometry you probably remember from school.
Pythagoras' theorem
a2 + b2 = c2
establishes that you can calculate the distance between two points on a plain
by summing the squares of their distance along orthogonal axes,
x and y in this case,
then taking the square root.
The same formula can be applied to vectors with greater dimensions
and this is one way to measure vector similarity,
called Euclidean distance.
Fundamentally, search is the process of finding items that are closest to a query. By mapping text to vectors so that similar meanings or ideas produce similar vectors, we turn semantic search into a mathematical problem.
At this point, you might be thinking that reliably generating vectors for any text sounds like the hardest part of all this. And you'd be right, but fortunately it's a solved problem.
Vectors can be generated using a large language model (LLM), the same technology that powers AI assistants like ChatGPT. LLMs are able do this because vectors are a data structure they use internally to model language. In LLM parlance, vectors are often referred to as embeddings. I'll use those two terms interchangeably for the remainder of this post.
Of course, the models that are used for chat completion are not the same ones used to generate embeddings. Instead there are specific models built especially for the purpose of turning text into vectors. There are many examples of these models out there, with different tradeoffs relating to size and speed. I recommend building an architecture that lets you plug in different models easily, so you can play with various options and get a sense of what's best for your application.
Each model will always generate the same dimensionality of vector, regardless of input text size. This means large documents will produce much shorter vectors, so you will want to break those documents into chunks before sending them to your embedding model. The size of those chunks directly affects the accuracy and precision of your search implementation. There are no hard and fast rules concerning the "best" chunk size, so experiment aggressively within the context of your application. I've seen it suggested elsewhere that a reasonable starting point is 4kb chunks, with a 10% overlap to preserve context around breakpoints. Depending on your application, you may also want to track the origin of each chunk within its parent document; e.g. chapters in a book or sections in a page.
For vector search to work, all of the documents from your search space must be converted to vectors and added to an index. With that in place, you can convert each search query to a vector and then find the items closest to it.
Indexing and search are typically implemented by a data store or search engine, although there are in-memory implementations available too. Picking the right implementation depends entirely on your usage. If you're working with ephemeral data of moderate size, an in-memory implementation might suit you best. If you need persistence and already have a data store in your backend, it may support vector search either in its core or via a plugin. There are also dedicated vector databases you can use.
I'm deliberately not identifying any specific data stores or libraries in this post, for two reasons:
I'm not an expert and there are far more qualified opinions than mine that can offer good advice.
The state of the art is evolving quite rapidly and whatever I write here will likely become outdated very soon.
That said, in my experience most implementations support multiple different algorithms and have many options you can tweak to configure performance. Picking the right implementation is probably more about fitting nicely with your existing infrastructure, than it is about agonising over the "right" one.
There are two main concepts at play when it comes to searching vectors.
The first one is distance, which I discussed a bit earlier. I mentioned Euclidean distance, which is the simplest way to measure vector similarity but apparently not the best at finding similar text. For that, cosine similarity works better although to be honest, my limited aptitude for maths prevents me from explaining or even understanding it properly. Something about relative angles, then extrapolating distances from those. 🤷
The other main concept to consider is how to identify nearest neighbours. In other words, given a search space populated with many vectors, how do you find the closest ones according to your distance metric?
Of course, you can just iterate through the entire space and collect nearest neighbours as you go. This is the brute force approach and works well for small datasets, or if indexing latency is more important than search latency. When there are many documents to search though, it may take too long.
Alternatively there's a family of algorithms collectively referred to as approximate nearest neighbour (ANN). Some of these involve constructing hash tables to optimise the process, others divide the search space into subgraphs and collate the results from traversing each one, still others follow an iterative approach of gradually partitioning the data until nearest neighbours are found, and there are hybrid algorithms that combine elements of all three. The cost of these ANN approaches is slower indexing and the payoff is much faster search operations. One of the fastest is called hierarchical navigable small world (HNSW), which is graph-based.
The suggestion I made about embedding models applies here too: experiment with different algorithms using data that is representative of real-world workloads for your application. This stuff is easy to measure and those measurements will be more informative than anything a random blog post from the internet can tell you.