Chapter 6: Performing Similarity Search Directly in ScyllaDB

Introduction

Welcome back, future vector search expert! In previous chapters, we explored the standalone power of USearch, learned how to create and query vector indexes, and understood the fundamental concepts behind vector embeddings. Now, it’s time to bring that power directly into your database.

This chapter is all about integrating vector search capabilities directly into ScyllaDB, a high-performance, real-time NoSQL database. ScyllaDB has embraced the growing need for AI-native applications by offering native vector search, leveraging USearch under the hood for its efficient Approximate Nearest Neighbor (ANN) indexing. This means you can store your data and its associated vector embeddings together and perform similarity queries without needing a separate vector database or complex synchronization. Pretty neat, right?

By the end of this chapter, you’ll understand how ScyllaDB’s vector search works, how to set it up, and how to perform blazing-fast similarity searches using simple CQL (Cassandra Query Language) commands. We’ll focus on practical, hands-on steps, ensuring you build a solid understanding.

To get the most out of this chapter, you should have a running ScyllaDB instance (version 5.2 or newer, as vector search was generally available from January 2026 onwards) and a basic grasp of CQL. If you need to set up ScyllaDB, refer to its official documentation.

Core Concepts: ScyllaDB’s Approach to Vector Search

ScyllaDB’s native vector search feature is a game-changer for real-time AI applications. Instead of exporting your data to a separate vector database, you can keep everything in one place, simplifying your architecture and reducing latency. Let’s break down the key components.

The `vector` Data Type

At the heart of ScyllaDB’s vector search is a new native data type: vector. This type allows you to store high-dimensional numerical vectors directly within your tables.

Think of it like this: just as you have int for whole numbers or text for strings, vector is specifically designed for numerical arrays that represent embeddings.

What it is: A vector<float, N> type stores a fixed-size array of floating-point numbers, where N is the dimension of your vectors. Why it’s important: It provides a native, optimized way to store embeddings, ensuring data integrity and efficient access. How it functions: When you define a column as vector<float, 1536> (a common dimension for many embedding models), ScyllaDB knows exactly how to handle that data type for storage and indexing.

Vector Indexing with `CREATE CUSTOM INDEX`

Storing vectors is one thing; searching them efficiently is another. ScyllaDB integrates USearch to provide Approximate Nearest Neighbor (ANN) indexing directly on your vector columns. This is achieved using the CREATE CUSTOM INDEX statement.

What it is: A custom index built on a vector column that enables fast similarity searches. Behind the scenes, ScyllaDB uses the USearch library to construct and manage this index. Why it’s important: Without an index, ScyllaDB would have to scan every single vector in your table to find similar ones, which is incredibly slow for large datasets. The index allows for rapid lookups, even across millions or billions of vectors. How it functions: When you create a vector index, ScyllaDB builds an ANN index (like HNSW, which USearch excels at) on that column. This index organizes your vectors in a way that allows ScyllaDB to quickly narrow down the search space to find approximate nearest neighbors.

You can configure several parameters for your vector index:

similarity_function: Determines how “similarity” is measured. Common options include COSINE (for cosine similarity), L2 (for Euclidean distance), and IP (for inner product).
index_type: Currently, ScyllaDB primarily supports HNSW (Hierarchical Navigable Small World), which is known for its excellent balance of speed and accuracy.
quantization: An optional optimization to reduce memory footprint and improve performance by storing vectors in a compressed format (e.g., INT8, BINARY). This comes with a trade-off in accuracy.

Performing Similarity Search with `ANN OF`

Once you have a vector column and an index, querying for similar items is straightforward using the ANN OF operator in your WHERE clause.

What it is: The ANN OF operator is ScyllaDB’s syntax for triggering an Approximate Nearest Neighbor search on an indexed vector column. Why it’s important: This is the magic keyword that tells ScyllaDB to use its vector index to find the most similar vectors to your query vector. How it functions: You provide a query vector, and ScyllaDB, using the underlying USearch index, returns the k (specified by LIMIT) closest vectors from your table, ordered by similarity.

Let’s visualize this flow:

Note: The EmbeddingService is typically external to ScyllaDB, generating the vectors you then store and query.

Step-by-Step Implementation

Let’s get our hands dirty and implement vector search in ScyllaDB. For this example, we’ll imagine a simple movie recommendation system where we store movie titles and their vector embeddings.

Prerequisites: Ensure your ScyllaDB instance (version 5.2.0 or newer is recommended for vector search GA) is running. You can connect to it using cqlsh, ScyllaDB’s command-line shell.

Connect to ScyllaDB
Open your terminal and connect to your ScyllaDB instance. If it’s running locally, the default is:
```
cqlsh
```
You should see the cqlsh> prompt.
Create a Keyspace
A keyspace in ScyllaDB is like a schema or database. We’ll create one for our movie data.
```
-- Create a keyspace named 'movie_recommendations'
-- with a replication factor of 1 (for a single-node setup).
CREATE KEYSPACE movie_recommendations WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
```
Explanation:
- CREATE KEYSPACE: The command to create a new keyspace.
- movie_recommendations: The name of our keyspace.
- WITH replication: Specifies the replication strategy. SimpleStrategy is good for single-datacenter deployments, and replication_factor: '1' means one copy of the data (suitable for a local dev setup).
Now, let’s switch to our new keyspace:
```
USE movie_recommendations;
```
Create a Table with a Vector Column
Next, we’ll create a table to store our movie data. This table will include a movie_vector column of type vector<float, 3>. We’re using a small dimension (3) for simplicity in this example, but in a real-world scenario, you’d likely use a dimension like 768 or 1536 from an embedding model.
```
-- Create a table to store movie information, including its vector embedding.
CREATE TABLE movies (
    movie_id UUID PRIMARY KEY,
    title TEXT,
    genre TEXT,
    movie_vector VECTOR<FLOAT, 3>
);
```
Explanation:
- CREATE TABLE movies: Creates a table named movies.
- movie_id UUID PRIMARY KEY: A unique identifier for each movie, serving as the primary key. UUID is a Universal Unique Identifier.
- title TEXT, genre TEXT: Standard text columns for movie metadata.
- movie_vector VECTOR<FLOAT, 3>: Our vector column! It will store 3-dimensional float vectors.

Insert Data with Vectors

Now, let’s add some movie data along with their “dummy” vector embeddings. In a real application, these vectors would come from an embedding model (e.g., OpenAI’s text-embedding-3-small).

-- Insert movie data with example 3-dimensional vectors.
INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'The Matrix', 'Sci-Fi', [0.1, 0.2, 0.9]);
INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Blade Runner 2049', 'Sci-Fi', [0.15, 0.25, 0.85]);
INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Dune', 'Sci-Fi', [0.05, 0.1, 0.95]);
INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Interstellar', 'Sci-Fi', [0.2, 0.3, 0.8]);
INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Forrest Gump', 'Drama', [0.8, 0.1, 0.2]);
INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Titanic', 'Romance', [0.75, 0.05, 0.1]);

Explanation:

INSERT INTO movies ... VALUES (...): Standard CQL insert statement.
uuid(): Generates a new UUID for movie_id.
[0.1, 0.2, 0.9]: This is how you represent a vector literal in CQL. It’s a list of float values.

You can verify the data was inserted:

SELECT * FROM movies;

Create a Vector Index
This is a crucial step! We’ll create a custom index on our movie_vector column to enable efficient similarity searches. We’ll use similarity_function = COSINE as it’s very common for embeddings.
```
-- Create a custom vector index on the 'movie_vector' column.
CREATE CUSTOM INDEX movie_vector_idx ON movies (movie_vector) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
    'mode': 'ANN',
    'similarity_function': 'COSINE',
    'index_type': 'HNSW'
};
```
Explanation:
- CREATE CUSTOM INDEX movie_vector_idx: Initiates the creation of a custom index named movie_vector_idx.
- ON movies (movie_vector): Specifies that the index is on the movie_vector column of the movies table.
- USING 'org.apache.cassandra.index.sasi.SASIIndex': ScyllaDB’s internal implementation leverages the SASI (SSTable Attached Secondary Index) framework for custom indexes, even for vector search. This is the standard way to declare custom indexes in ScyllaDB.
- WITH OPTIONS = {...}: Here’s where we configure the vector index:
  - 'mode': 'ANN': Explicitly tells ScyllaDB to build an Approximate Nearest Neighbor index.
  - 'similarity_function': 'COSINE': Configures the index to use cosine similarity for vector comparisons.
  - 'index_type': 'HNSW': Specifies the Hierarchical Navigable Small World algorithm, powered by USearch, for the underlying index structure.
Index creation might take a moment, especially with larger datasets. ScyllaDB will build the USearch index in the background.
Perform a Similarity Search
Now for the fun part! Let’s find movies similar to a given query vector. Imagine we have a new movie idea, and we want to find existing movies that are conceptually similar.
Let’s use a query vector [0.12, 0.22, 0.88], which is somewhat similar to our Sci-Fi movies.
```
-- Search for movies similar to our query vector, limiting to the top 2 results.
SELECT title, genre, movie_vector
FROM movies
WHERE movie_vector ANN OF [0.12, 0.22, 0.88]
LIMIT 2;
```
Explanation:
- SELECT title, genre, movie_vector: We want to retrieve the title, genre, and the vector itself for the results.
- FROM movies: Querying our movies table.
- WHERE movie_vector ANN OF [0.12, 0.22, 0.88]: This is the core of the vector search. It tells ScyllaDB to find items where movie_vector is an Approximate Nearest Neighbor of [0.12, 0.22, 0.88].
- LIMIT 2: Restricts the results to the top 2 most similar movies. Always use LIMIT with ANN OF queries to control the number of results and prevent excessive resource usage.
You should see results similar to ‘The Matrix’ and ‘Blade Runner 2049’, as their vectors are numerically closest to our query vector in this example.
What if we queried with a vector similar to our Drama/Romance movies, say [0.7, 0.08, 0.15]?
```
-- Find movies similar to a drama/romance-like vector.
SELECT title, genre, movie_vector
FROM movies
WHERE movie_vector ANN OF [0.7, 0.08, 0.15]
LIMIT 2;
```
You would likely get ‘Forrest Gump’ and ‘Titanic’ as results. This demonstrates how ScyllaDB, powered by USearch, can effectively find semantically similar items based on their vector embeddings!

Mini-Challenge: Explore Different Similarity Functions

You’ve successfully performed your first vector search in ScyllaDB! Now, let’s try a small modification to deepen your understanding.

Challenge:

Drop the existing movie_vector_idx index.
Create a new vector index on the movie_vector column, but this time use similarity_function = L2 (Euclidean distance) instead of COSINE.
Re-run the similarity search with the query vector [0.12, 0.22, 0.88] and LIMIT 2.
Observe if the results change. Why might they be different (or similar)?

Hint:

To drop an index, use DROP INDEX movie_vector_idx;
Remember that COSINE measures the angle between vectors (direction), while L2 measures the straight-line distance between their endpoints (magnitude and direction). This can lead to different “nearest” neighbors, especially if your vectors are not normalized.

What to observe/learn: Pay attention to how the choice of similarity function can influence the ranking of results. While for normalized vectors, COSINE and L2 often yield similar rankings, for unnormalized vectors, they can diverge significantly. This helps you understand the importance of choosing the right metric for your specific embedding model and use case.

Common Pitfalls & Troubleshooting

Working with vector search, especially when integrated into a database, can sometimes present challenges. Here are a few common pitfalls and how to troubleshoot them:

Incorrect Vector Dimension (N Mismatch):
- Pitfall: Defining a VECTOR<FLOAT, N> column with a dimension N that doesn’t match the actual dimension of the vectors you’re trying to insert or query.
- Troubleshooting: ScyllaDB will return an error about dimension mismatch. Always ensure the N in your table schema, your INSERT statements, and your ANN OF queries are consistent with the output dimension of your embedding model. This is critical.
Missing or Incorrect Vector Index:
- Pitfall: Attempting an ANN OF query on a vector column that either doesn’t have a custom vector index, or the index was created with incorrect options (e.g., mode not set to ANN).
- Troubleshooting: ScyllaDB will usually return an error indicating that an ANN index is required. Double-check your CREATE CUSTOM INDEX statement for typos, especially in mode, similarity_function, and index_type options. Verify the index exists using DESCRIBE TABLE movies; (or your table name) and looking for index details.
Performance Issues with Large Result Sets (LIMIT):
- Pitfall: Performing an ANN OF query without a LIMIT clause, or with an excessively large LIMIT on a massive dataset.
- Troubleshooting: While ScyllaDB and USearch are highly optimized, retrieving a very large number of approximate nearest neighbors still requires significant processing and data transfer. For real-time applications, always use a reasonable LIMIT (e.g., 10, 50, or 100) to fetch only the most relevant results. If you need more results, consider pagination or re-evaluating your application’s needs.
No Results Found:
- Pitfall: Your similarity search returns no rows, even when you expect some.
- Troubleshooting: This can happen if your query vector is truly very far from all vectors in your database, or if your dataset is very small and sparse.
  - Check Data: Ensure you have enough data inserted and that the vectors are varied.
  - Query Vector: Double-check your query vector. Is it representative of the data you expect to find?
  - Similarity Function: As explored in the mini-challenge, the similarity_function (COSINE, L2, IP) can significantly impact results. Ensure it’s appropriate for your embeddings. For example, cosine similarity is best for normalized vectors where direction matters most.

Summary

Congratulations! You’ve successfully integrated and utilized ScyllaDB’s native vector search capabilities. This is a powerful step towards building modern, AI-driven applications that require real-time similarity search at scale.

Here are the key takeaways from this chapter:

ScyllaDB’s Native Vector Support: ScyllaDB (version 5.2.0+) now natively supports storing and querying high-dimensional vectors using the VECTOR<FLOAT, N> data type.
USearch Under the Hood: ScyllaDB leverages the efficient open-source USearch library to power its Approximate Nearest Neighbor (ANN) indexing.
Creating Vector Indexes: You create vector indexes using CREATE CUSTOM INDEX ... WITH OPTIONS = {'mode': 'ANN', 'similarity_function': 'COSINE', 'index_type': 'HNSW'}.
Performing Similarity Searches: The ANN OF operator in your WHERE clause allows you to query for similar vectors directly in CQL: SELECT ... WHERE vector_column ANN OF [query_vector] LIMIT k;.
Importance of LIMIT: Always use LIMIT with ANN OF queries for efficient, real-time results.
Choosing Similarity Function: The similarity_function (e.g., COSINE, L2) impacts how similarity is calculated and should match your embedding strategy.

You now have a robust foundation for building applications that can perform semantic search, recommendation systems, anomaly detection, and more, all within a highly scalable and performant database.

What’s Next?

In the next chapter, we’ll explore how to interact with ScyllaDB’s vector search from client applications using popular programming languages like Python. We’ll also delve into more advanced indexing strategies and performance tuning considerations for production deployments. Stay curious!

References

ScyllaDB Documentation: Vector Search Overview. https://docs.scylladb.com/manual/master/features/vector-search.html
ScyllaDB Press Release: ScyllaDB Brings Massive-Scale Vector Search to Real-Time AI. https://www.scylladb.com/press-release/scylladb-brings-massive-scale-vector-search-to-real-time-ai/
ScyllaDB GitHub: Vector Search Examples. https://github.com/scylladb/vector-search-examples
USearch GitHub Repository. https://github.com/unum-cloud/USearch

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.