Easy Semantic Search with Custom Vector Embeddings and Typesense |

I’ve been kicking the tires on Typesense, an open-source search engine and alternative to Elasticsearch, for a personal project. So far I’m quite impressed with it. This is a small post recording how easy it was to set up vector embedding fields using a custom model.

It is getting much easier to build semantic search applications. In previous work when I was setting up semantic search it was both exciting and frustrating, because the tooling wasn’t there and you had to build basically every piece yourself. It was cutting-edge, but you had to make your own cutting tools before you really saw the benefit. In my case, I had been using Elasticsearch and had to augment it with annoy, (using approximate K nearest neighbour in place of cosine distance, until Elasticsearch added vector search support).

Using Custom Models

Typesense natively supports vector search, and integrates with OpenAI, OpenAI-compatible APIs and even loading its own ONNX embedding models (it has a number of models it can use of this purpose). As I’ve been keeping up with embedding models, I’ve seen the work by Mixed Bread AI on their mxbai-embed-large-v1 embedding model, a state-of-the-art model that outperforms OpenAI text-embedding-v3, and I was interested in using it.

Fortunately Typesense makes this easy. I’m running it in docker, with a folder data for all related data. To set up Typesense to automatically create an embedding as it indexes using your custom model, first create the folder data/models/mxbai-embed-large-v1. Mixed Bread’s model is available on HuggingFace, ONNX format is required.

Here are the steps I took to configure and use the model. I downloaded the fp16 version of the model, and renamed it to model.onnx, putting it in data/models/mxbai-embed-large-v1/model.onnx, I also downloaded the associated vocab.txt and put it in the same place. Lastly create a config.json with the following content:

{
  "model_type": "bert",
  "vocab_file_name": "vocab.txt"
}

Schema setup

Now, when creating the schema for your collection, you can configure additional fields that will be embeddings of text fields:

schema = {
    "name": "sites",
    "fields": [
        {"name": "title", "type": "string"},
        {"name": "author", "type": "string", "optional": True},
        {"name": "url", "type": "string"},
        {"name": "text", "type": "string", "optional": True},
        {
            "name": "text_embedding",
            "type": "float[]",
            "embed": {
                "from": ["text"],
                "model_config": {
                    "model_name": "mxbai-embed-large-v1",
                },
            },
        },
    ],
}
create_response = client.collections.create(schema)

All the work will be done for you at index time. Performance can be slowed by this (I’ve seen docs take up to 4 seconds to index), but this can be helped by working in bulk or parallelizing requests. Querying is fast and accurate, which is exactly the trade-off I’d want to see.

Typesense is designed without any sort of UI, so I’ve been using typesense-ui to browse records, pull random ids for testing etc and it has worked well so far. Overall I’ve been very impressed.

Using Custom Models#

Schema setup#

Using Custom Models

Schema setup