I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be recommended as a vector, and written out into a search index, in this case OpenSearch. At inference time, the same model is used to encode the input string, and OpenSearch's vector search is used to find and recommend the nearest string to the input in vector space.

My dataset consisted of about 10,000 input-output pairs. I split it up into 90% training set (approx 9,000 pairs) and 10% test set (approx 1,000 pairs). I chose the sentence-transformers/all-MiniLM-L6-v2 model, a good pre-trained general-purpose model for vector search in its own right, that maps pairs into a dense 384-dimensional space. Sentence Transformer models use Contrastive Learning, which means I need both positive and negative pairs to train it, but my training set are all positive pairs by definition. Rather than try to generate negative pairs on my own, I used the built-in MultipleNegativesRanking (MNR) loss, which takes positive pairs and generates negative pairs by sampling from the batches. I trained for 10 epochs, using the AdamW optimizer (the default) with learning rate 2e-5 (also the default), saving the best checkpoint (using similarity on validation set).

To evaluate, I generated the top-100 nearest neighbors for each input of the test set pairs, and then computed Recall @k and MRR (Mean Reciprocal Rank) @K for `k = 1, 3, 5, 10, 20, 50, 100`

. For recall, I would score a match @k as successful if the output of the pair appeared within the top k nearest neighbors returned from the vector search. This score is then averaged across all the 1,000 test set pairs for each value of k. MRR is similar, except the recall score for each test set pair and k is divided by the (1-based) position that matched (and 0 if no match).

The baseline for the experiment was computed using an index of encodings of the input part created using the stock `all-MiniLM-L6-v2`

model.

I had also recently read Jo Kristian Bergum's blog posts Billion Scale Vector Search with Vespa, part one and two, and more recently Matryoshka Binary Vectors: Slash Vector Search Costs with Vespa on the Vespa blog, where he compares performance of vectors of different storage types, among other things. Vespa allows for many different storage types, but I am using OpenSearch, which offers support for `float (float32)`

, `byte (int8)`

and `binary (bool)`

storage types. I was curious to see (a) if I could replace my float based vectors with these, and if so, how their performance would compare. This is what this post is about.

The index mappings for the float, byte and binary vector field is as follows. These would need to be set during index creation.

**Float Vector**

1 2 3 4 5 6 7 8 9 10 11 12 13 | { "type": "knn_vector", "dimension": 384, "method": { "name": "hnsw", "engine": "lucene", "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 24 } } } |

**Byte vector**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | { "type": "knn_vector", "dimension": 384, "data_type": "byte", "method": { "name": "hnsw", "engine": "lucene", "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 24, } } } |

**Binary vector**

1 2 3 4 | { "type": "binary", "doc_values": "true" } |

To generate the vectors, I used the fine-tuned version of the `all-MiniLM-L6-v2`

model as my encoder, and post-processed the float32 vector returned from the encoder to int8 and binary using the following functions.

1 2 3 4 5 6 7 8 9 10 11 | def convert_to_bytes(arr: np.ndarray): return np.floor(arr * 128).astype(np.int8) def convert_to_bits(arr: np.ndarray): bits = np.packbits( np.where(arr > 0, 1, 0)).astype(np.int8) arr_b = bits.reshape(arr.shape[0], -1) hex_bits = [] for row in arr_b: hex_bits.append(str(binascii.hexlify(row), "utf-8")) return hex_bits |

At the end of this indexing process, I had three separate indexes of approximately 10,000 records, one with one part of the pair encoded as a float32 vector, another encoded as a byte (int8) vector, and the last as a binary vector. To give you a rough idea of the storage requirements (rough because there are fields other than the vectors for each index), the sizes reported by `/_cat/indices`

are shown below.

Vector Storage Type | Index Size |
---|---|

float | 184.9 MB |

byte | 39.6 MB |

binary | 15.6 MB |

On the query side, I use the following Script Score queries as described in the Byte-quantized vectors in OpenSearch blog post and Exact k-NN with scoring script documentation pages. The queries are all score scripts as shown below. The query for float and byte vectors are the same, the only difference is that the `float_vector`

is quantized down to int8 in the second case.

**Float and byte vector**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | { "script_score": { "query": { "match_all": {} }, "script": { "source": "knn_score", "lang": "knn", "params": { "field": "{field_name}", "query_value": "{float_or_byte_vector}", "space_type": "cosinesimil" } } } } |

**Binary Vector**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | { "script_score": { "query": { "match_all": {} }, "script": { "source": "knn_score", "lang": "knn", "params": { "field": "{field_name}", "query_value": "{binary_vector}", "space_type": "hammingbit" } } } } |

The chart below shows the Recall and MRR @k for various values of k as described above.

First thing to note is that fine-tuning helps, or at least it helped a lot in this case, probably because the notion of similarity I was working with was more nuanced than Cosine similarity. With respect to float vectors versus byte vectors, float vectors have a slight edge as you can see from the table below (if you look really hard you can also see the float vector (orange line) and byte vector (green line) almost overlaid on each other. While binary vectors don't perform as well, they still perform better than the baseline, and they are much faster, so they can be useful as the first stage of a two-stage retrieval pipeline.

**Mean Reciprocal Rank**

**Recall**

In terms of response time, binary vectors are the fastest, followed by byte vectors, followed by float vectors. I measured the response time for each of the 1,000 test set queries to extract 100 nearest neighbors from OpenSearch, then calculated the mean and standard deviation for each vector storage type. Here are the results.

Vector Storage Type | Mean Response Time | Standard Deviation |
---|---|---|

float | 0.382 s | 0.097 s |

byte | 0.231 s | 0.050 s |

binary | 0.176 s | 0.048 s |

I originally brought this up in the Relevance Slack channel because I had made some mistakes in my evaluation and was getting results that did not agree with common sense. Thanks to all my "non-work colleagues" on the channel who helped me out by validating my observations and being sounding boards which eventually helped me find the mistakes in my analysis. In any case, I figured that the final results may be useful to the broader community of vector search enthusiasts and professionals, so I am sharing this. Hope you find it useful.

## Be the first to comment. Comments are moderated to prevent spam.

Post a Comment