Welcome to Chapter 13! In this exciting chapter, we’re going to put everything we’ve learned about USearch and ScyllaDB into action by building a practical, real-world application: a movie recommendation system. This project will solidify your understanding of how vector search powers intelligent applications, enabling personalized experiences for users.
By the end of this chapter, you’ll have a working recommendation engine that suggests movies based on semantic similarity. We’ll cover everything from preparing movie data and generating embeddings to storing them efficiently in ScyllaDB and performing lightning-fast similarity searches with the help of USearch’s underlying technology. Get ready to dive into the practical magic of AI-driven recommendations!
This chapter assumes you’re familiar with the core concepts of vector embeddings, USearch indexing, and basic ScyllaDB operations as covered in previous chapters. If any of these sound new, a quick refresh of earlier material might be helpful.
Core Concepts: The Recommendation Engine Blueprint
Before we start coding, let’s map out the core ideas behind our movie recommendation system. Understanding these foundational concepts will make the implementation much clearer.
2.1. Embeddings: The Language of Similarity
Ever wondered how a computer “understands” that “a thrilling space adventure” is similar to “an epic sci-fi saga” but different from “a romantic comedy”? The answer lies in embeddings.
What are Embeddings? Embeddings are numerical representations of text, images, audio, or any complex data type. Think of them as high-dimensional vectors (lists of numbers) where the “meaning” or “context” of the original data is encoded.
Why are they Important for Recommendations? The magic of embeddings is that semantically similar items (e.g., movies with similar plots or genres) will have embedding vectors that are “close” to each other in the high-dimensional space. Conversely, dissimilar items will be far apart. This property is exactly what we need for recommendations: find movies whose embeddings are closest to a user’s watched movie or expressed preference.
For our movie recommendation system, we’ll convert movie descriptions (or plots) into these numerical embeddings. We won’t be training our own complex embedding model from scratch; instead, we’ll leverage a pre-trained model, which is a common and efficient approach in many applications.
2.2. Vector Search: Finding the Needle in the Haystack
Once we have our movie embeddings, the next challenge is efficiently finding the nearest neighbors to a given movie’s embedding. This is where vector search comes into play.
How USearch Makes it Fast Traditional database searches are great for exact matches or range queries. However, finding the “most similar” vector among millions or billions requires specialized algorithms. USearch excels at Approximate Nearest Neighbor (ANN) search. It builds highly optimized indexes that allow it to quickly locate vectors that are “close enough” to a query vector, even if not perfectly the closest, offering a fantastic balance between speed and accuracy. This speed is crucial for real-time recommendation systems where users expect instant suggestions.
2.3. ScyllaDB’s Role: Scalable Vector Storage
Storing millions of movie embeddings and serving real-time similarity queries demands a robust and scalable database. This is where ScyllaDB shines, especially with its integrated Vector Search capabilities.
Why ScyllaDB for Vectors? ScyllaDB is a high-performance, open-source NoSQL database known for its low-latency and high-throughput capabilities. Its recent integration of vector search, leveraging libraries like USearch under the hood, makes it an ideal choice for our project:
- Scalability: Easily handles massive datasets and high query loads.
- Real-time Performance: Delivers recommendations with minimal latency.
- Integrated Vector Search: ScyllaDB now supports a
VECTORdata type and theANN OFquery syntax, simplifying the process of storing vectors and performing similarity searches directly within the database. This means you don’t need a separate vector database in many cases.
This integrated approach simplifies our architecture and management, making ScyllaDB a powerful backbone for our recommendation engine.
2.4. System Architecture: A Visual Overview
Let’s visualize how these components fit together in our movie recommendation system.
Explanation of the Flow:
- A. & B. Data Ingestion: We’ll start by taking raw movie data, generating embeddings for their descriptions using a Python script, and then storing both the movie details and their embeddings into ScyllaDB.
- 1. User Interaction: A user interacts with our (simple) web application, perhaps by selecting a movie they like.
- 2. & 3. Similarity Search: The application fetches the embedding of the selected movie. Then, it queries ScyllaDB’s Vector Search feature to find other movies with the most similar embeddings. ScyllaDB, powered by USearch, efficiently finds these nearest neighbors.
- 4. & 5. Recommendations: ScyllaDB returns the top
Kmost similar movies, which our application then displays to the user as recommendations.
This architecture is robust, scalable, and leverages the strengths of both USearch and ScyllaDB for an efficient recommendation system.
Step-by-Step Implementation: Building Our Movie Recommender
Now, let’s get our hands dirty and build the recommendation system piece by piece. We’ll use Python for our application logic.
3.1. Setting Up Our Environment
First things first, let’s create a clean Python environment and install the necessary libraries.
Create a Virtual Environment: Open your terminal or command prompt and run:
python3 -m venv rec_envThis creates a new virtual environment named
rec_env.Activate the Virtual Environment:
- On macOS/Linux:
source rec_env/bin/activate - On Windows:
.\rec_env\Scripts\activate
You should see
(rec_env)at the beginning of your prompt, indicating the environment is active.- On macOS/Linux:
Install Dependencies: We’ll need:
scylla-driver: To connect and interact with ScyllaDB.usearch: The Python bindings for the USearch vector search library. As of 2026-02-17, we’ll assume a stable release like3.1.0for the Python bindings, which internally uses the core USearch library (e.g., version6.25.0). Always check PyPI for the absolute latest version:pip install usearch.sentence-transformers: A powerful library for easily generating sentence embeddings.
Run the following command:
pip install scylla-driver usearch==3.1.0 sentence-transformers==2.7.0Note: The version numbers
usearch==3.1.0andsentence-transformers==2.7.0are speculative for 2026-02-17. Please check PyPI for the latest stable versions when you run this, e.g.,pip install usearchandpip install sentence-transformers.
3.2. ScyllaDB Schema for Movies
Before we can store data, we need a place for it in ScyllaDB. We’ll define a keyspace and a movies table. This table will include columns for movie details and, critically, a VECTOR column for our embeddings, along with an ANN index for efficient vector search.
Let’s use a Python script to connect to ScyllaDB and create our schema.
Create a new Python file named schema_setup.py:
# schema_setup.py
from cassandra.cluster import Cluster, ResultSet
from cassandra.auth import PlainTextAuthProvider
import os
# ScyllaDB Connection Details
SCYLLADB_CONTACT_POINTS = os.environ.get("SCYLLADB_CONTACT_POINTS", "127.0.0.1").split(',')
SCYLLADB_USERNAME = os.environ.get("SCYLLADB_USERNAME", "cassandra")
SCYLLADB_PASSWORD = os.environ.get("SCYLLADB_PASSWORD", "cassandra")
SCYLLADB_PORT = int(os.environ.get("SCYLLADB_PORT", 9042))
KEYSPACE_NAME = "movie_recommendations"
TABLE_NAME = "movies"
EMBEDDING_DIMENSION = 384 # This will match our sentence-transformer model's output dimension
def setup_scylladb_schema():
"""
Connects to ScyllaDB and sets up the keyspace and movie table with vector search.
"""
auth_provider = PlainTextAuthProvider(username=SCYLLADB_USERNAME, password=SCYLLADB_PASSWORD)
cluster = Cluster(
contact_points=SCYLLADB_CONTACT_POINTS,
port=SCYLLADB_PORT,
auth_provider=auth_provider
)
session = None
try:
session = cluster.connect()
print(f"Connected to ScyllaDB at {SCYLLADB_CONTACT_POINTS}")
# 1. Create Keyspace
create_keyspace_query = f"""
CREATE KEYSPACE IF NOT EXISTS {KEYSPACE_NAME}
WITH replication = {{'class': 'NetworkTopologyStrategy', 'datacenter1': 1}};
"""
session.execute(create_keyspace_query)
session.set_keyspace(KEYSPACE_NAME)
print(f"Keyspace '{KEYSPACE_NAME}' ensured.")
# 2. Create Movies Table with VECTOR type
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
movie_id UUID PRIMARY KEY,
title TEXT,
description TEXT,
genre TEXT,
release_year INT,
embedding VECTOR<FLOAT, {EMBEDDING_DIMENSION}>
);
"""
session.execute(create_table_query)
print(f"Table '{TABLE_NAME}' ensured.")
# 3. Create Vector Index (ANN Index)
# This index enables efficient Approximate Nearest Neighbor search on the 'embedding' column.
create_index_query = f"""
CREATE CUSTOM INDEX IF NOT EXISTS movies_embedding_idx
ON {TABLE_NAME} (embedding)
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
WITH OPTIONS = {{'similarity_function': 'COSINE', 'dim': '{EMBEDDING_DIMENSION}'}};
"""
session.execute(create_index_query)
print(f"Vector index 'movies_embedding_idx' ensured for {TABLE_NAME}.embedding.")
print("ScyllaDB schema setup complete!")
except Exception as e:
print(f"Error setting up ScyllaDB schema: {e}")
finally:
if session:
session.shutdown()
if cluster:
cluster.shutdown()
if __name__ == "__main__":
setup_scylladb_schema()
Explanation:
SCYLLADB_CONTACT_POINTS,SCYLLADB_USERNAME,SCYLLADB_PASSWORD: These are environment variables for connecting to your ScyllaDB instance. For a local setup,127.0.0.1andcassandraare often defaults.EMBEDDING_DIMENSION: We set this to384because a commonsentence-transformersmodel (all-MiniLM-L6-v2) outputs 384-dimensional embeddings. It’s crucial this matches your chosen embedding model.Clusterandsession: Standardscylla-driversetup to connect to the database.CREATE KEYSPACE IF NOT EXISTS: Creates ourmovie_recommendationskeyspace if it doesn’t already exist.NetworkTopologyStrategyis recommended for production.CREATE TABLE IF NOT EXISTS movies: Defines ourmoviestable.movie_id UUID PRIMARY KEY: A unique identifier for each movie.title,description,genre,release_year: Standard movie metadata.embedding VECTOR<FLOAT, {EMBEDDING_DIMENSION}>: This is the critical part! It defines a column to store our 384-dimensional floating-point vectors. This is ScyllaDB’s native vector data type.
CREATE CUSTOM INDEX ... USING 'org.apache.cassandra.index.sai.StorageAttachedIndex': This creates a Storage-Attached Index (SAI) on ourembeddingcolumn, enabling ScyllaDB’s integrated vector search.'similarity_function': 'COSINE': Specifies that we want to use cosine similarity, which is a common and effective metric for comparing embeddings. Other options likeEUCLIDEANorDOT_PRODUCTmight be available depending on your ScyllaDB version.'dim': '{EMBEDDING_DIMENSION}': Explicitly tells the index the dimension of the vectors it will be handling.
To run this, make sure your ScyllaDB instance is running. If you’re running ScyllaDB locally via Docker, it might look like this:
docker run --name scylladb -d -p 9042:9042 scylladb/scylla:5.3.0
Note: ScyllaDB 5.3.0 (or later) is required for the GA Vector Search features mentioned as of January 2026. Always refer to the official ScyllaDB documentation for the latest stable version and setup instructions.
Once ScyllaDB is up, run your Python script:
python schema_setup.py
You should see output indicating successful connection and schema creation.
3.3. Generating Movie Embeddings
Now that our database schema is ready, let’s generate some mock movie data and their embeddings. For simplicity, we’ll use a small, hardcoded list of movies. In a real application, this data would come from a larger dataset.
Create a new Python file named data_ingestion.py:
# data_ingestion.py
from cassandra.cluster import Cluster, ResultSet
from cassandra.auth import PlainTextAuthProvider
from sentence_transformers import SentenceTransformer
import uuid
import os
import numpy as np
# ScyllaDB Connection Details (same as schema_setup.py)
SCYLLADB_CONTACT_POINTS = os.environ.get("SCYLLADB_CONTACT_POINTS", "127.0.0.1").split(',')
SCYLLADB_USERNAME = os.environ.get("SCYLLADB_USERNAME", "cassandra")
SCYLLADB_PASSWORD = os.environ.get("SCYLLADB_PASSWORD", "cassandra")
SCYLLADB_PORT = int(os.environ.get("SCYLLADB_PORT", 9042))
KEYSPACE_NAME = "movie_recommendations"
TABLE_NAME = "movies"
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2' # A good general-purpose model
# Mock movie data
# In a real system, this would come from a CSV, API, etc.
mock_movies = [
{"title": "Inception", "description": "A thief who steals information by entering people's dreams is given the inverse task of planting an idea into a C.E.O.'s mind.", "genre": "Sci-Fi", "release_year": 2010},
{"title": "The Matrix", "description": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.", "genre": "Sci-Fi", "release_year": 1999},
{"title": "Interstellar", "description": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.", "genre": "Sci-Fi", "release_year": 2014},
{"title": "Pulp Fiction", "description": "The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.", "genre": "Crime", "release_year": 1994},
{"title": "Forrest Gump", "description": "The presidencies of Kennedy and Johnson, the Vietnam War, the Watergate scandal and other historical events unfold from the perspective of an Alabama man with an IQ of 75, whose only desire is to be reunited with his childhood sweetheart.", "genre": "Drama", "release_year": 1994},
{"title": "The Shawshank Redemption", "description": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.", "genre": "Drama", "release_year": 1994},
{"title": "Toy Story", "description": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.", "genre": "Animation", "release_year": 1995},
{"title": "Finding Nemo", "description": "After his son is captured in the Great Barrier Reef and taken to Sydney, a timid clownfish sets out on a journey to bring him home.", "genre": "Animation", "release_year": 2003},
{"title": "Spirited Away", "description": "During her family's move to the suburbs, a sullen 10-year-old girl wanders into a world ruled by gods, witches, and spirits, and where humans are changed into beasts.", "genre": "Animation", "release_year": 2001},
{"title": "La La Land", "description": "While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future.", "genre": "Musical", "release_year": 2016},
{"title": "Eternal Sunshine of the Spotless Mind", "description": "When their relationship turns sour, a couple undergoes a medical procedure to have each other erased from their memories.", "genre": "Romance", "release_year": 2004},
{"title": "The Grand Budapest Hotel", "description": "The adventures of Gustave H, a legendary concierge at a famous hotel from the interwar period, and Zero Moustafa, the lobby boy who becomes his most trusted friend.", "genre": "Comedy", "release_year": 2014},
]
def generate_embeddings_and_ingest():
"""
Generates embeddings for mock movie data and ingests them into ScyllaDB.
"""
auth_provider = PlainTextAuthProvider(username=SCYLLADB_USERNAME, password=SCYLLADB_PASSWORD)
cluster = Cluster(
contact_points=SCYLLADB_CONTACT_POINTS,
port=SCYLLADB_PORT,
auth_provider=auth_provider
)
session = None
try:
session = cluster.connect(KEYSPACE_NAME)
print(f"Connected to ScyllaDB keyspace '{KEYSPACE_NAME}'")
# Load the pre-trained Sentence Transformer model
print(f"Loading Sentence Transformer model: {EMBEDDING_MODEL_NAME}...")
model = SentenceTransformer(EMBEDDING_MODEL_NAME)
print("Model loaded.")
# Prepare the insert statement
insert_movie_query = session.prepare(f"""
INSERT INTO {TABLE_NAME} (movie_id, title, description, genre, release_year, embedding)
VALUES (?, ?, ?, ?, ?, ?);
""")
print("Generating embeddings and inserting movies...")
for movie_data in mock_movies:
movie_id = uuid.uuid4()
description = movie_data["description"]
# Generate embedding for the movie description
# The model outputs a numpy array, which ScyllaDB driver handles correctly for VECTOR type
embedding = model.encode(description, convert_to_tensor=False)
# Insert into ScyllaDB
session.execute(insert_movie_query, (
movie_id,
movie_data["title"],
description,
movie_data["genre"],
movie_data["release_year"],
embedding.tolist() # Convert numpy array to list for the driver
))
print(f"Inserted movie: {movie_data['title']}")
print(f"Successfully ingested {len(mock_movies)} movies into ScyllaDB.")
except Exception as e:
print(f"Error during data ingestion: {e}")
finally:
if session:
session.shutdown()
if cluster:
cluster.shutdown()
if __name__ == "__main__":
generate_embeddings_and_ingest()
Explanation:
SentenceTransformer(EMBEDDING_MODEL_NAME): This line loads our pre-trained model.all-MiniLM-L6-v2is a good choice for its balance of performance and size, producing 384-dimensional embeddings.model.encode(description, convert_to_tensor=False): This is where the magic happens! The model takes a movie description (text) and converts it into a numerical vector (a NumPy array of floats).convert_to_tensor=Falseensures we get a NumPy array directly.insert_movie_query: We use a prepared statement for efficient insertion. Notice howembeddingis passed as a parameter. Thescylla-driverautomatically handles converting a Python list of floats into ScyllaDB’sVECTORtype.embedding.tolist(): Thescylla-driverexpects a Python list forVECTORtypes, so we convert the NumPy array.
Run the data ingestion script:
python data_ingestion.py
You should see messages indicating each movie being processed and inserted into ScyllaDB.
3.4. Building the Recommendation Logic
Now for the core of our system: the recommendation logic! We’ll create functions to:
- Retrieve a movie’s embedding by its title.
- Query ScyllaDB for similar movies using the
ANN OFclause. - Display the recommendations.
Create a new Python file named recommender.py:
# recommender.py
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from sentence_transformers import SentenceTransformer
import os
import uuid
import numpy as np
# ScyllaDB Connection Details (same as schema_setup.py)
SCYLLADB_CONTACT_POINTS = os.environ.get("SCYLLADB_CONTACT_POINTS", "127.0.0.1").split(',')
SCYLLADB_USERNAME = os.environ.get("SCYLLADB_USERNAME", "cassandra")
SCYLLADB_PASSWORD = os.environ.get("SCYLLADB_PASSWORD", "cassandra")
SCYLLADB_PORT = int(os.environ.get("SCYLLADB_PORT", 9042))
KEYSPACE_NAME = "movie_recommendations"
TABLE_NAME = "movies"
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'
class MovieRecommender:
def __init__(self):
self.auth_provider = PlainTextAuthProvider(username=SCYLLADB_USERNAME, password=SCYLLADB_PASSWORD)
self.cluster = Cluster(
contact_points=SCYLLADB_CONTACT_POINTS,
port=SCYLLADB_PORT,
auth_provider=self.auth_provider
)
self.session = self.cluster.connect(KEYSPACE_NAME)
print(f"Recommender connected to ScyllaDB keyspace '{KEYSPACE_NAME}'")
print(f"Loading Sentence Transformer model: {EMBEDDING_MODEL_NAME} for recommender...")
self.model = SentenceTransformer(EMBEDDING_MODEL_NAME)
print("Model loaded for recommender.")
def _get_movie_embedding_by_title(self, movie_title: str) -> np.ndarray | None:
"""
Retrieves the embedding of a movie by its title from ScyllaDB.
"""
query = f"SELECT embedding FROM {TABLE_NAME} WHERE title = ? ALLOW FILTERING;"
row = self.session.execute(query, (movie_title,)).one()
if row:
# ScyllaDB returns VECTOR as a list, convert back to numpy array for consistency
return np.array(row.embedding)
return None
def _get_movie_details_by_id(self, movie_id: uuid.UUID) -> dict | None:
"""
Retrieves full details of a movie by its ID.
"""
query = f"SELECT title, description, genre, release_year FROM {TABLE_NAME} WHERE movie_id = ?;"
row = self.session.execute(query, (movie_id,)).one()
if row:
return {
"movie_id": movie_id,
"title": row.title,
"description": row.description,
"genre": row.genre,
"release_year": row.release_year
}
return None
def get_recommendations(self, movie_title: str, limit: int = 5) -> list[dict]:
"""
Generates movie recommendations based on a given movie title.
"""
print(f"\nSearching for recommendations for: '{movie_title}'")
# 1. Get the embedding for the reference movie
reference_embedding = self._get_movie_embedding_by_title(movie_title)
if reference_embedding is None:
print(f"Movie '{movie_title}' not found in the database.")
return []
# 2. Perform ANN search using ScyllaDB's VECTOR search
# The `ANN OF` clause is the core of our vector similarity query.
# We also filter out the movie itself from recommendations.
ann_query = f"""
SELECT movie_id, title, description, genre, release_year
FROM {TABLE_NAME}
ORDER BY embedding ANN OF ?
LIMIT ?;
"""
# ScyllaDB expects the vector as a list for the ANN OF clause
results: ResultSet = self.session.execute(ann_query, (reference_embedding.tolist(), limit + 1)) # +1 to exclude the reference movie itself
recommendations = []
for row in results:
if row.title == movie_title: # Exclude the movie we searched for
continue
recommendations.append({
"movie_id": row.movie_id,
"title": row.title,
"description": row.description,
"genre": row.genre,
"release_year": row.release_year
})
if len(recommendations) >= limit:
break
return recommendations
def get_recommendations_from_text(self, query_text: str, limit: int = 5) -> list[dict]:
"""
Generates movie recommendations based on a free-form text query.
"""
print(f"\nSearching for recommendations based on text query: '{query_text}'")
# 1. Generate embedding for the free-form text query
query_embedding = self.model.encode(query_text, convert_to_tensor=False)
# 2. Perform ANN search directly with the query embedding
ann_query = f"""
SELECT movie_id, title, description, genre, release_year
FROM {TABLE_NAME}
ORDER BY embedding ANN OF ?
LIMIT ?;
"""
results: ResultSet = self.session.execute(ann_query, (query_embedding.tolist(), limit))
recommendations = []
for row in results:
recommendations.append({
"movie_id": row.movie_id,
"title": row.title,
"description": row.description,
"genre": row.genre,
"release_year": row.release_year
})
return recommendations
def shutdown(self):
"""Closes the ScyllaDB connection."""
if self.session:
self.session.shutdown()
if self.cluster:
self.cluster.shutdown()
print("ScyllaDB connection closed.")
if __name__ == "__main__":
recommender = MovieRecommender()
# Example 1: Get recommendations based on an existing movie
recommended_movies = recommender.get_recommendations("Inception", limit=3)
print("\n--- Recommendations for 'Inception' ---")
if recommended_movies:
for movie in recommended_movies:
print(f"- {movie['title']} ({movie['release_year']}) - Genre: {movie['genre']}")
# print(f" Description: {movie['description'][:100]}...")
else:
print("No recommendations found.")
# Example 2: Get recommendations based on a free-form text query
text_query = "movies about space travel and futuristic technologies"
recommended_movies_from_text = recommender.get_recommendations_from_text(text_query, limit=3)
print(f"\n--- Recommendations for text query: '{text_query}' ---")
if recommended_movies_from_text:
for movie in recommended_movies_from_text:
print(f"- {movie['title']} ({movie['release_year']}) - Genre: {movie['genre']}")
else:
print("No recommendations found.")
recommender.shutdown()
Explanation:
MovieRecommenderClass: Encapsulates our recommendation logic, including ScyllaDB connection and the embedding model._get_movie_embedding_by_title: A helper function to fetch the embedding of a specific movie from the database. We useALLOW FILTERINGhere becausetitleis not a primary key, but in a real system, you’d likely query bymovie_id(primary key) for performance.get_recommendations(self, movie_title: str, limit: int = 5):- It first retrieves the embedding of the
movie_titleprovided by the user. ORDER BY embedding ANN OF ?: This is the heart of the vector search! ScyllaDB’sANN OFclause takes a query vector (ourreference_embedding) and returns results ordered by their similarity to it.LIMIT ?: Limits the number of recommendations returned.- We add
+ 1to the limit and then manually filter out the original movie to ensure we getlimitunique recommendations.
- It first retrieves the embedding of the
get_recommendations_from_text(self, query_text: str, limit: int = 5): This demonstrates a powerful feature: you can also generate recommendations directly from a natural language query! The input text is first converted into an embedding, which is then used in theANN OFquery. This allows users to describe what they’re looking for.if __name__ == "__main__":block: This section provides simple examples to test our recommender.
Run the recommender script:
python recommender.py
You should see output similar to this (results may vary slightly due to ANN approximations):
Recommender connected to ScyllaDB keyspace 'movie_recommendations'
Loading Sentence Transformer model: all-MiniLM-L6-v2 for recommender...
Model loaded for recommender.
Searching for recommendations for: 'Inception'
--- Recommendations for 'Inception' ---
- The Matrix (1999) - Genre: Sci-Fi
- Interstellar (2014) - Genre: Sci-Fi
- Eternal Sunshine of the Spotless Mind (2004) - Genre: Romance
Searching for recommendations based on text query: 'movies about space travel and futuristic technologies'
--- Recommendations for text query: 'movies about space travel and futuristic technologies' ---
- Interstellar (2014) - Genre: Sci-Fi
- The Matrix (1999) - Genre: Sci-Fi
- Inception (2010) - Genre: Sci-Fi
ScyllaDB connection closed.
Notice how “Inception” and “The Matrix” (both Sci-Fi) are recommended for “Interstellar,” and the text query correctly surfaces Sci-Fi movies. Pretty cool, right?
Mini-Challenge: Enhance the Search
Our recommender is working great, but what if a user wants recommendations for “Inception” but only for animation movies? That doesn’t make much sense, but imagine they want “Sci-Fi movies like Inception, but from the 90s.”
Challenge:
Modify the get_recommendations function (or create a new one) to allow filtering recommendations by genre and/or release_year in addition to the vector similarity search. This will combine the power of traditional structured queries with vector search.
Hint:
ScyllaDB’s ANN OF clause can be combined with WHERE clauses for additional filtering. For example:
SELECT ... FROM movies WHERE genre = 'Sci-Fi' ORDER BY embedding ANN OF ? LIMIT ?;
Remember that for WHERE clauses to be efficient on non-primary key columns, you might need a secondary index if you were dealing with a very large dataset and frequent filtering. However, for genre and release_year on our small dataset, it will work.
What to observe/learn:
- How to combine semantic search (ANN) with exact filtering.
- The syntax for combining
WHEREandORDER BY embedding ANN OFclauses. - The flexibility of hybrid search approaches.
Common Pitfalls & Troubleshooting
Even with a structured approach, you might encounter some bumps along the road. Here are a few common pitfalls and how to troubleshoot them:
Embedding Model Mismatch (Dimensions):
- Pitfall: Your
EMBEDDING_DIMENSIONinschema_setup.pydoesn’t match the actual output dimension of yourSentenceTransformermodel. For example, if you setEMBEDDING_DIMENSION = 768but useall-MiniLM-L6-v2(which outputs 384), you’ll get errors during data ingestion or index creation. - Troubleshooting: Always verify the output dimension of your chosen
SentenceTransformermodel. You can do this by encoding a dummy sentence:Ensure this value matchesfrom sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') dummy_embedding = model.encode("hello world") print(len(dummy_embedding)) # Should print 384EMBEDDING_DIMENSIONin your schema.
- Pitfall: Your
ScyllaDB Connection Issues:
- Pitfall:
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': ConnectionRefusedError(111, "Tried to connect to ['127.0.0.1:9042']. Last error: Connection refused")}) - Troubleshooting:
- Is your ScyllaDB instance running? Check your Docker containers (
docker ps) or system services. - Is the port correct (
9042is default)? - Are the
SCYLLADB_CONTACT_POINTS,SCYLLADB_USERNAME,SCYLLADB_PASSWORDcorrect? Double-check environment variables or hardcoded values. - If running in Docker, ensure the port is exposed (
-p 9042:9042).
- Is your ScyllaDB instance running? Check your Docker containers (
- Pitfall:
ANN OFQuery Syntax Errors or Index Not Found:- Pitfall: You might get
Invalid query: No vector index found for table movie_recommendations.movies on column embedding with dimension 384or a generic syntax error for theANN OFclause. - Troubleshooting:
- Did you run
schema_setup.py? Ensure the custom indexmovies_embedding_idxwas successfully created. - ScyllaDB Version: Vector search (
VECTORtype,ANN OFclause, SAI withsimilarity_function) is a relatively new feature. Make sure you are running ScyllaDB5.3.0or a later version. Older versions will not support these features. - Index Options: Double-check the
similarity_functionanddimoptions in yourCREATE CUSTOM INDEXstatement. They must match the expected values. - Case Sensitivity: ScyllaDB keyspace and table names are typically lowercase. Ensure consistency.
- Did you run
- Pitfall: You might get
Summary
Congratulations! You’ve successfully built a functional movie recommendation system powered by USearch and ScyllaDB’s integrated vector search.
Here are the key takeaways from this chapter:
- Embeddings are fundamental: They transform complex data like movie descriptions into numerical vectors, enabling semantic comparisons.
- Vector Search is crucial for relevance: USearch, integrated into ScyllaDB, provides the high-performance Approximate Nearest Neighbor (ANN) search needed for real-time recommendations.
- ScyllaDB offers scalable vector storage: Its
VECTORdata type andANN OFquery syntax simplify storing and querying embeddings at scale. - Hybrid search is powerful: You can combine vector similarity search with traditional database filtering (
WHEREclauses) for even more precise recommendations. - Practical application: You’ve seen how these technologies come together to create intelligent, personalized user experiences.
What’s next? You could expand this project by:
- Integrating with a real movie dataset (e.g., MovieLens).
- Building a simple web UI for your recommender.
- Experimenting with different embedding models and similarity metrics.
- Exploring advanced ScyllaDB features for performance tuning.
Keep exploring, keep building, and remember the power of vector search in shaping the future of AI applications!
References
- ScyllaDB Vector Search Documentation
- USearch GitHub Repository
- Sentence Transformers Documentation
- ScyllaDB Python Driver Documentation
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.