Chapter 13: Building a Movie Recommendation System

Welcome to Chapter 13! In this exciting chapter, we’re going to put everything we’ve learned about USearch and ScyllaDB into action by building a practical, real-world application: a movie recommendation system. This project will solidify your understanding of how vector search powers intelligent applications, enabling personalized experiences for users.

By the end of this chapter, you’ll have a working recommendation engine that suggests movies based on semantic similarity. We’ll cover everything from preparing movie data and generating embeddings to storing them efficiently in ScyllaDB and performing lightning-fast similarity searches with the help of USearch’s underlying technology. Get ready to dive into the practical magic of AI-driven recommendations!

This chapter assumes you’re familiar with the core concepts of vector embeddings, USearch indexing, and basic ScyllaDB operations as covered in previous chapters. If any of these sound new, a quick refresh of earlier material might be helpful.

Core Concepts: The Recommendation Engine Blueprint

Before we start coding, let’s map out the core ideas behind our movie recommendation system. Understanding these foundational concepts will make the implementation much clearer.

2.1. Embeddings: The Language of Similarity

Ever wondered how a computer “understands” that “a thrilling space adventure” is similar to “an epic sci-fi saga” but different from “a romantic comedy”? The answer lies in embeddings.

What are Embeddings? Embeddings are numerical representations of text, images, audio, or any complex data type. Think of them as high-dimensional vectors (lists of numbers) where the “meaning” or “context” of the original data is encoded.

Why are they Important for Recommendations? The magic of embeddings is that semantically similar items (e.g., movies with similar plots or genres) will have embedding vectors that are “close” to each other in the high-dimensional space. Conversely, dissimilar items will be far apart. This property is exactly what we need for recommendations: find movies whose embeddings are closest to a user’s watched movie or expressed preference.

For our movie recommendation system, we’ll convert movie descriptions (or plots) into these numerical embeddings. We won’t be training our own complex embedding model from scratch; instead, we’ll leverage a pre-trained model, which is a common and efficient approach in many applications.

2.2. Vector Search: Finding the Needle in the Haystack

Once we have our movie embeddings, the next challenge is efficiently finding the nearest neighbors to a given movie’s embedding. This is where vector search comes into play.

How USearch Makes it Fast Traditional database searches are great for exact matches or range queries. However, finding the “most similar” vector among millions or billions requires specialized algorithms. USearch excels at Approximate Nearest Neighbor (ANN) search. It builds highly optimized indexes that allow it to quickly locate vectors that are “close enough” to a query vector, even if not perfectly the closest, offering a fantastic balance between speed and accuracy. This speed is crucial for real-time recommendation systems where users expect instant suggestions.

2.3. ScyllaDB’s Role: Scalable Vector Storage

Storing millions of movie embeddings and serving real-time similarity queries demands a robust and scalable database. This is where ScyllaDB shines, especially with its integrated Vector Search capabilities.

Why ScyllaDB for Vectors? ScyllaDB is a high-performance, open-source NoSQL database known for its low-latency and high-throughput capabilities. Its recent integration of vector search, leveraging libraries like USearch under the hood, makes it an ideal choice for our project:

Scalability: Easily handles massive datasets and high query loads.
Real-time Performance: Delivers recommendations with minimal latency.
Integrated Vector Search: ScyllaDB now supports a VECTOR data type and the ANN OF query syntax, simplifying the process of storing vectors and performing similarity searches directly within the database. This means you don’t need a separate vector database in many cases.

This integrated approach simplifies our architecture and management, making ScyllaDB a powerful backbone for our recommendation engine.

2.4. System Architecture: A Visual Overview

Let’s visualize how these components fit together in our movie recommendation system.

Explanation of the Flow:

A. & B. Data Ingestion: We’ll start by taking raw movie data, generating embeddings for their descriptions using a Python script, and then storing both the movie details and their embeddings into ScyllaDB.
1. User Interaction: A user interacts with our (simple) web application, perhaps by selecting a movie they like.
2. & 3. Similarity Search: The application fetches the embedding of the selected movie. Then, it queries ScyllaDB’s Vector Search feature to find other movies with the most similar embeddings. ScyllaDB, powered by USearch, efficiently finds these nearest neighbors.
4. & 5. Recommendations: ScyllaDB returns the top K most similar movies, which our application then displays to the user as recommendations.

This architecture is robust, scalable, and leverages the strengths of both USearch and ScyllaDB for an efficient recommendation system.

Step-by-Step Implementation: Building Our Movie Recommender

Now, let’s get our hands dirty and build the recommendation system piece by piece. We’ll use Python for our application logic.

3.1. Setting Up Our Environment

First things first, let’s create a clean Python environment and install the necessary libraries.

Create a Virtual Environment: Open your terminal or command prompt and run:
```
python3 -m venv rec_env
```
This creates a new virtual environment named rec_env.
Activate the Virtual Environment:
- On macOS/Linux:
```
source rec_env/bin/activate
```
- On Windows:
```
.\rec_env\Scripts\activate
```
You should see (rec_env) at the beginning of your prompt, indicating the environment is active.
Install Dependencies: We’ll need:
- scylla-driver: To connect and interact with ScyllaDB.
- usearch: The Python bindings for the USearch vector search library. As of 2026-02-17, we’ll assume a stable release like 3.1.0 for the Python bindings, which internally uses the core USearch library (e.g., version 6.25.0). Always check PyPI for the absolute latest version: pip install usearch.
- sentence-transformers: A powerful library for easily generating sentence embeddings.
Run the following command:
```
pip install scylla-driver usearch==3.1.0 sentence-transformers==2.7.0
```
Note: The version numbers usearch==3.1.0 and sentence-transformers==2.7.0 are speculative for 2026-02-17. Please check PyPI for the latest stable versions when you run this, e.g., pip install usearch and pip install sentence-transformers.

3.2. ScyllaDB Schema for Movies

Before we can store data, we need a place for it in ScyllaDB. We’ll define a keyspace and a movies table. This table will include columns for movie details and, critically, a VECTOR column for our embeddings, along with an ANN index for efficient vector search.

Let’s use a Python script to connect to ScyllaDB and create our schema.

Create a new Python file named schema_setup.py:

# schema_setup.py
from cassandra.cluster import Cluster, ResultSet
from cassandra.auth import PlainTextAuthProvider
import os

# ScyllaDB Connection Details
SCYLLADB_CONTACT_POINTS = os.environ.get("SCYLLADB_CONTACT_POINTS", "127.0.0.1").split(',')
SCYLLADB_USERNAME = os.environ.get("SCYLLADB_USERNAME", "cassandra")
SCYLLADB_PASSWORD = os.environ.get("SCYLLADB_PASSWORD", "cassandra")
SCYLLADB_PORT = int(os.environ.get("SCYLLADB_PORT", 9042))

KEYSPACE_NAME = "movie_recommendations"
TABLE_NAME = "movies"
EMBEDDING_DIMENSION = 384 # This will match our sentence-transformer model's output dimension

def setup_scylladb_schema():
    """
    Connects to ScyllaDB and sets up the keyspace and movie table with vector search.
    """
    auth_provider = PlainTextAuthProvider(username=SCYLLADB_USERNAME, password=SCYLLADB_PASSWORD)
    cluster = Cluster(
        contact_points=SCYLLADB_CONTACT_POINTS,
        port=SCYLLADB_PORT,
        auth_provider=auth_provider
    )
    session = None
    try:
        session = cluster.connect()
        print(f"Connected to ScyllaDB at {SCYLLADB_CONTACT_POINTS}")

        # 1. Create Keyspace
        create_keyspace_query = f"""
        CREATE KEYSPACE IF NOT EXISTS {KEYSPACE_NAME}
        WITH replication = {{'class': 'NetworkTopologyStrategy', 'datacenter1': 1}};
        """
        session.execute(create_keyspace_query)
        session.set_keyspace(KEYSPACE_NAME)
        print(f"Keyspace '{KEYSPACE_NAME}' ensured.")

        # 2. Create Movies Table with VECTOR type
        create_table_query = f"""
        CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
            movie_id UUID PRIMARY KEY,
            title TEXT,
            description TEXT,
            genre TEXT,
            release_year INT,
            embedding VECTOR<FLOAT, {EMBEDDING_DIMENSION}>
        );
        """
        session.execute(create_table_query)
        print(f"Table '{TABLE_NAME}' ensured.")

        # 3. Create Vector Index (ANN Index)
        # This index enables efficient Approximate Nearest Neighbor search on the 'embedding' column.
        create_index_query = f"""
        CREATE CUSTOM INDEX IF NOT EXISTS movies_embedding_idx
        ON {TABLE_NAME} (embedding)
        USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
        WITH OPTIONS = {{'similarity_function': 'COSINE', 'dim': '{EMBEDDING_DIMENSION}'}};
        """
        session.execute(create_index_query)
        print(f"Vector index 'movies_embedding_idx' ensured for {TABLE_NAME}.embedding.")
        print("ScyllaDB schema setup complete!")

    except Exception as e:
        print(f"Error setting up ScyllaDB schema: {e}")
    finally:
        if session:
            session.shutdown()
        if cluster:
            cluster.shutdown()

if __name__ == "__main__":
    setup_scylladb_schema()

Explanation:

SCYLLADB_CONTACT_POINTS, SCYLLADB_USERNAME, SCYLLADB_PASSWORD: These are environment variables for connecting to your ScyllaDB instance. For a local setup, 127.0.0.1 and cassandra are often defaults.
EMBEDDING_DIMENSION: We set this to 384 because a common sentence-transformers model (all-MiniLM-L6-v2) outputs 384-dimensional embeddings. It’s crucial this matches your chosen embedding model.
Cluster and session: Standard scylla-driver setup to connect to the database.
CREATE KEYSPACE IF NOT EXISTS: Creates our movie_recommendations keyspace if it doesn’t already exist. NetworkTopologyStrategy is recommended for production.
CREATE TABLE IF NOT EXISTS movies: Defines our movies table.
- movie_id UUID PRIMARY KEY: A unique identifier for each movie.
- title, description, genre, release_year: Standard movie metadata.
- embedding VECTOR<FLOAT, {EMBEDDING_DIMENSION}>: This is the critical part! It defines a column to store our 384-dimensional floating-point vectors. This is ScyllaDB’s native vector data type.
CREATE CUSTOM INDEX ... USING 'org.apache.cassandra.index.sai.StorageAttachedIndex': This creates a Storage-Attached Index (SAI) on our embedding column, enabling ScyllaDB’s integrated vector search.
- 'similarity_function': 'COSINE': Specifies that we want to use cosine similarity, which is a common and effective metric for comparing embeddings. Other options like EUCLIDEAN or DOT_PRODUCT might be available depending on your ScyllaDB version.
- 'dim': '{EMBEDDING_DIMENSION}': Explicitly tells the index the dimension of the vectors it will be handling.

To run this, make sure your ScyllaDB instance is running. If you’re running ScyllaDB locally via Docker, it might look like this:

docker run --name scylladb -d -p 9042:9042 scylladb/scylla:5.3.0

Note: ScyllaDB 5.3.0 (or later) is required for the GA Vector Search features mentioned as of January 2026. Always refer to the official ScyllaDB documentation for the latest stable version and setup instructions.

Once ScyllaDB is up, run your Python script:

python schema_setup.py

You should see output indicating successful connection and schema creation.

3.3. Generating Movie Embeddings

Now that our database schema is ready, let’s generate some mock movie data and their embeddings. For simplicity, we’ll use a small, hardcoded list of movies. In a real application, this data would come from a larger dataset.

Create a new Python file named data_ingestion.py:

# data_ingestion.py
from cassandra.cluster import Cluster, ResultSet
from cassandra.auth import PlainTextAuthProvider
from sentence_transformers import SentenceTransformer
import uuid
import os
import numpy as np

# ScyllaDB Connection Details (same as schema_setup.py)
SCYLLADB_CONTACT_POINTS = os.environ.get("SCYLLADB_CONTACT_POINTS", "127.0.0.1").split(',')
SCYLLADB_USERNAME = os.environ.get("SCYLLADB_USERNAME", "cassandra")
SCYLLADB_PASSWORD = os.environ.get("SCYLLADB_PASSWORD", "cassandra")
SCYLLADB_PORT = int(os.environ.get("SCYLLADB_PORT", 9042))

KEYSPACE_NAME = "movie_recommendations"
TABLE_NAME = "movies"
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2' # A good general-purpose model

# Mock movie data
# In a real system, this would come from a CSV, API, etc.
mock_movies = [
    {"title": "Inception", "description": "A thief who steals information by entering people's dreams is given the inverse task of planting an idea into a C.E.O.'s mind.", "genre": "Sci-Fi", "release_year": 2010},
    {"title": "The Matrix", "description": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.", "genre": "Sci-Fi", "release_year": 1999},
    {"title": "Interstellar", "description": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.", "genre": "Sci-Fi", "release_year": 2014},
    {"title": "Pulp Fiction", "description": "The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.", "genre": "Crime", "release_year": 1994},
    {"title": "Forrest Gump", "description": "The presidencies of Kennedy and Johnson, the Vietnam War, the Watergate scandal and other historical events unfold from the perspective of an Alabama man with an IQ of 75, whose only desire is to be reunited with his childhood sweetheart.", "genre": "Drama", "release_year": 1994},
    {"title": "The Shawshank Redemption", "description": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.", "genre": "Drama", "release_year": 1994},
    {"title": "Toy Story", "description": "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.", "genre": "Animation", "release_year": 1995},
    {"title": "Finding Nemo", "description": "After his son is captured in the Great Barrier Reef and taken to Sydney, a timid clownfish sets out on a journey to bring him home.", "genre": "Animation", "release_year": 2003},
    {"title": "Spirited Away", "description": "During her family's move to the suburbs, a sullen 10-year-old girl wanders into a world ruled by gods, witches, and spirits, and where humans are changed into beasts.", "genre": "Animation", "release_year": 2001},
    {"title": "La La Land", "description": "While navigating their careers in Los Angeles, a pianist and an actress fall in love while attempting to reconcile their aspirations for the future.", "genre": "Musical", "release_year": 2016},
    {"title": "Eternal Sunshine of the Spotless Mind", "description": "When their relationship turns sour, a couple undergoes a medical procedure to have each other erased from their memories.", "genre": "Romance", "release_year": 2004},
    {"title": "The Grand Budapest Hotel", "description": "The adventures of Gustave H, a legendary concierge at a famous hotel from the interwar period, and Zero Moustafa, the lobby boy who becomes his most trusted friend.", "genre": "Comedy", "release_year": 2014},
]

def generate_embeddings_and_ingest():
    """
    Generates embeddings for mock movie data and ingests them into ScyllaDB.
    """
    auth_provider = PlainTextAuthProvider(username=SCYLLADB_USERNAME, password=SCYLLADB_PASSWORD)
    cluster = Cluster(
        contact_points=SCYLLADB_CONTACT_POINTS,
        port=SCYLLADB_PORT,
        auth_provider=auth_provider
    )
    session = None
    try:
        session = cluster.connect(KEYSPACE_NAME)
        print(f"Connected to ScyllaDB keyspace '{KEYSPACE_NAME}'")

        # Load the pre-trained Sentence Transformer model
        print(f"Loading Sentence Transformer model: {EMBEDDING_MODEL_NAME}...")
        model = SentenceTransformer(EMBEDDING_MODEL_NAME)
        print("Model loaded.")

        # Prepare the insert statement
        insert_movie_query = session.prepare(f"""
            INSERT INTO {TABLE_NAME} (movie_id, title, description, genre, release_year, embedding)
            VALUES (?, ?, ?, ?, ?, ?);
        """)

        print("Generating embeddings and inserting movies...")
        for movie_data in mock_movies:
            movie_id = uuid.uuid4()
            description = movie_data["description"]

            # Generate embedding for the movie description
            # The model outputs a numpy array, which ScyllaDB driver handles correctly for VECTOR type
            embedding = model.encode(description, convert_to_tensor=False)

            # Insert into ScyllaDB
            session.execute(insert_movie_query, (
                movie_id,
                movie_data["title"],
                description,
                movie_data["genre"],
                movie_data["release_year"],
                embedding.tolist() # Convert numpy array to list for the driver
            ))
            print(f"Inserted movie: {movie_data['title']}")

        print(f"Successfully ingested {len(mock_movies)} movies into ScyllaDB.")

    except Exception as e:
        print(f"Error during data ingestion: {e}")
    finally:
        if session:
            session.shutdown()
        if cluster:
            cluster.shutdown()

if __name__ == "__main__":
    generate_embeddings_and_ingest()

Explanation:

SentenceTransformer(EMBEDDING_MODEL_NAME): This line loads our pre-trained model. all-MiniLM-L6-v2 is a good choice for its balance of performance and size, producing 384-dimensional embeddings.
model.encode(description, convert_to_tensor=False): This is where the magic happens! The model takes a movie description (text) and converts it into a numerical vector (a NumPy array of floats). convert_to_tensor=False ensures we get a NumPy array directly.
insert_movie_query: We use a prepared statement for efficient insertion. Notice how embedding is passed as a parameter. The scylla-driver automatically handles converting a Python list of floats into ScyllaDB’s VECTOR type.
embedding.tolist(): The scylla-driver expects a Python list for VECTOR types, so we convert the NumPy array.

Run the data ingestion script:

python data_ingestion.py

You should see messages indicating each movie being processed and inserted into ScyllaDB.

3.4. Building the Recommendation Logic

Now for the core of our system: the recommendation logic! We’ll create functions to:

Retrieve a movie’s embedding by its title.
Query ScyllaDB for similar movies using the ANN OF clause.
Display the recommendations.

Create a new Python file named recommender.py:

# recommender.py
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from sentence_transformers import SentenceTransformer
import os
import uuid
import numpy as np

# ScyllaDB Connection Details (same as schema_setup.py)
SCYLLADB_CONTACT_POINTS = os.environ.get("SCYLLADB_CONTACT_POINTS", "127.0.0.1").split(',')
SCYLLADB_USERNAME = os.environ.get("SCYLLADB_USERNAME", "cassandra")
SCYLLADB_PASSWORD = os.environ.get("SCYLLADB_PASSWORD", "cassandra")
SCYLLADB_PORT = int(os.environ.get("SCYLLADB_PORT", 9042))

KEYSPACE_NAME = "movie_recommendations"
TABLE_NAME = "movies"
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'

class MovieRecommender:
    def __init__(self):
        self.auth_provider = PlainTextAuthProvider(username=SCYLLADB_USERNAME, password=SCYLLADB_PASSWORD)
        self.cluster = Cluster(
            contact_points=SCYLLADB_CONTACT_POINTS,
            port=SCYLLADB_PORT,
            auth_provider=self.auth_provider
        )
        self.session = self.cluster.connect(KEYSPACE_NAME)
        print(f"Recommender connected to ScyllaDB keyspace '{KEYSPACE_NAME}'")

        print(f"Loading Sentence Transformer model: {EMBEDDING_MODEL_NAME} for recommender...")
        self.model = SentenceTransformer(EMBEDDING_MODEL_NAME)
        print("Model loaded for recommender.")

    def _get_movie_embedding_by_title(self, movie_title: str) -> np.ndarray | None:
        """
        Retrieves the embedding of a movie by its title from ScyllaDB.
        """
        query = f"SELECT embedding FROM {TABLE_NAME} WHERE title = ? ALLOW FILTERING;"
        row = self.session.execute(query, (movie_title,)).one()
        if row:
            # ScyllaDB returns VECTOR as a list, convert back to numpy array for consistency
            return np.array(row.embedding)
        return None

    def _get_movie_details_by_id(self, movie_id: uuid.UUID) -> dict | None:
        """
        Retrieves full details of a movie by its ID.
        """
        query = f"SELECT title, description, genre, release_year FROM {TABLE_NAME} WHERE movie_id = ?;"
        row = self.session.execute(query, (movie_id,)).one()
        if row:
            return {
                "movie_id": movie_id,
                "title": row.title,
                "description": row.description,
                "genre": row.genre,
                "release_year": row.release_year
            }
        return None

    def get_recommendations(self, movie_title: str, limit: int = 5) -> list[dict]:
        """
        Generates movie recommendations based on a given movie title.
        """
        print(f"\nSearching for recommendations for: '{movie_title}'")

        # 1. Get the embedding for the reference movie
        reference_embedding = self._get_movie_embedding_by_title(movie_title)
        if reference_embedding is None:
            print(f"Movie '{movie_title}' not found in the database.")
            return []

        # 2. Perform ANN search using ScyllaDB's VECTOR search
        # The `ANN OF` clause is the core of our vector similarity query.
        # We also filter out the movie itself from recommendations.
        ann_query = f"""
            SELECT movie_id, title, description, genre, release_year
            FROM {TABLE_NAME}
            ORDER BY embedding ANN OF ?
            LIMIT ?;
        """
        # ScyllaDB expects the vector as a list for the ANN OF clause
        results: ResultSet = self.session.execute(ann_query, (reference_embedding.tolist(), limit + 1)) # +1 to exclude the reference movie itself

        recommendations = []
        for row in results:
            if row.title == movie_title: # Exclude the movie we searched for
                continue
            recommendations.append({
                "movie_id": row.movie_id,
                "title": row.title,
                "description": row.description,
                "genre": row.genre,
                "release_year": row.release_year
            })
            if len(recommendations) >= limit:
                break
        return recommendations

    def get_recommendations_from_text(self, query_text: str, limit: int = 5) -> list[dict]:
        """
        Generates movie recommendations based on a free-form text query.
        """
        print(f"\nSearching for recommendations based on text query: '{query_text}'")

        # 1. Generate embedding for the free-form text query
        query_embedding = self.model.encode(query_text, convert_to_tensor=False)

        # 2. Perform ANN search directly with the query embedding
        ann_query = f"""
            SELECT movie_id, title, description, genre, release_year
            FROM {TABLE_NAME}
            ORDER BY embedding ANN OF ?
            LIMIT ?;
        """
        results: ResultSet = self.session.execute(ann_query, (query_embedding.tolist(), limit))

        recommendations = []
        for row in results:
            recommendations.append({
                "movie_id": row.movie_id,
                "title": row.title,
                "description": row.description,
                "genre": row.genre,
                "release_year": row.release_year
            })
        return recommendations

    def shutdown(self):
        """Closes the ScyllaDB connection."""
        if self.session:
            self.session.shutdown()
        if self.cluster:
            self.cluster.shutdown()
        print("ScyllaDB connection closed.")

if __name__ == "__main__":
    recommender = MovieRecommender()

    # Example 1: Get recommendations based on an existing movie
    recommended_movies = recommender.get_recommendations("Inception", limit=3)
    print("\n--- Recommendations for 'Inception' ---")
    if recommended_movies:
        for movie in recommended_movies:
            print(f"- {movie['title']} ({movie['release_year']}) - Genre: {movie['genre']}")
            # print(f"  Description: {movie['description'][:100]}...")
    else:
        print("No recommendations found.")

    # Example 2: Get recommendations based on a free-form text query
    text_query = "movies about space travel and futuristic technologies"
    recommended_movies_from_text = recommender.get_recommendations_from_text(text_query, limit=3)
    print(f"\n--- Recommendations for text query: '{text_query}' ---")
    if recommended_movies_from_text:
        for movie in recommended_movies_from_text:
            print(f"- {movie['title']} ({movie['release_year']}) - Genre: {movie['genre']}")
    else:
        print("No recommendations found.")

    recommender.shutdown()

Explanation:

MovieRecommender Class: Encapsulates our recommendation logic, including ScyllaDB connection and the embedding model.
_get_movie_embedding_by_title: A helper function to fetch the embedding of a specific movie from the database. We use ALLOW FILTERING here because title is not a primary key, but in a real system, you’d likely query by movie_id (primary key) for performance.
get_recommendations(self, movie_title: str, limit: int = 5):
- It first retrieves the embedding of the movie_title provided by the user.
- ORDER BY embedding ANN OF ?: This is the heart of the vector search! ScyllaDB’s ANN OF clause takes a query vector (our reference_embedding) and returns results ordered by their similarity to it.
- LIMIT ?: Limits the number of recommendations returned.
- We add + 1 to the limit and then manually filter out the original movie to ensure we get limit unique recommendations.
get_recommendations_from_text(self, query_text: str, limit: int = 5): This demonstrates a powerful feature: you can also generate recommendations directly from a natural language query! The input text is first converted into an embedding, which is then used in the ANN OF query. This allows users to describe what they’re looking for.
if __name__ == "__main__": block: This section provides simple examples to test our recommender.

Run the recommender script:

python recommender.py

You should see output similar to this (results may vary slightly due to ANN approximations):

Recommender connected to ScyllaDB keyspace 'movie_recommendations'
Loading Sentence Transformer model: all-MiniLM-L6-v2 for recommender...
Model loaded for recommender.

Searching for recommendations for: 'Inception'

--- Recommendations for 'Inception' ---
- The Matrix (1999) - Genre: Sci-Fi
- Interstellar (2014) - Genre: Sci-Fi
- Eternal Sunshine of the Spotless Mind (2004) - Genre: Romance

Searching for recommendations based on text query: 'movies about space travel and futuristic technologies'

--- Recommendations for text query: 'movies about space travel and futuristic technologies' ---
- Interstellar (2014) - Genre: Sci-Fi
- The Matrix (1999) - Genre: Sci-Fi
- Inception (2010) - Genre: Sci-Fi
ScyllaDB connection closed.

Notice how “Inception” and “The Matrix” (both Sci-Fi) are recommended for “Interstellar,” and the text query correctly surfaces Sci-Fi movies. Pretty cool, right?

Mini-Challenge: Enhance the Search

Our recommender is working great, but what if a user wants recommendations for “Inception” but only for animation movies? That doesn’t make much sense, but imagine they want “Sci-Fi movies like Inception, but from the 90s.”

Challenge: Modify the get_recommendations function (or create a new one) to allow filtering recommendations by genre and/or release_year in addition to the vector similarity search. This will combine the power of traditional structured queries with vector search.

Hint: ScyllaDB’s ANN OF clause can be combined with WHERE clauses for additional filtering. For example:

SELECT ... FROM movies WHERE genre = 'Sci-Fi' ORDER BY embedding ANN OF ? LIMIT ?;

Remember that for WHERE clauses to be efficient on non-primary key columns, you might need a secondary index if you were dealing with a very large dataset and frequent filtering. However, for genre and release_year on our small dataset, it will work.

What to observe/learn:

How to combine semantic search (ANN) with exact filtering.
The syntax for combining WHERE and ORDER BY embedding ANN OF clauses.
The flexibility of hybrid search approaches.

Common Pitfalls & Troubleshooting

Even with a structured approach, you might encounter some bumps along the road. Here are a few common pitfalls and how to troubleshoot them:

Embedding Model Mismatch (Dimensions):
- Pitfall: Your EMBEDDING_DIMENSION in schema_setup.py doesn’t match the actual output dimension of your SentenceTransformer model. For example, if you set EMBEDDING_DIMENSION = 768 but use all-MiniLM-L6-v2 (which outputs 384), you’ll get errors during data ingestion or index creation.
- Troubleshooting: Always verify the output dimension of your chosen SentenceTransformer model. You can do this by encoding a dummy sentence:
```
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
dummy_embedding = model.encode("hello world")
print(len(dummy_embedding)) # Should print 384
```
  Ensure this value matches EMBEDDING_DIMENSION in your schema.
ScyllaDB Connection Issues:
- Pitfall: cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': ConnectionRefusedError(111, "Tried to connect to ['127.0.0.1:9042']. Last error: Connection refused")})
- Troubleshooting:
  - Is your ScyllaDB instance running? Check your Docker containers (docker ps) or system services.
  - Is the port correct (9042 is default)?
  - Are the SCYLLADB_CONTACT_POINTS, SCYLLADB_USERNAME, SCYLLADB_PASSWORD correct? Double-check environment variables or hardcoded values.
  - If running in Docker, ensure the port is exposed (-p 9042:9042).
ANN OF Query Syntax Errors or Index Not Found:
- Pitfall: You might get Invalid query: No vector index found for table movie_recommendations.movies on column embedding with dimension 384 or a generic syntax error for the ANN OF clause.
- Troubleshooting:
  - Did you run schema_setup.py? Ensure the custom index movies_embedding_idx was successfully created.
  - ScyllaDB Version: Vector search (VECTOR type, ANN OF clause, SAI with similarity_function) is a relatively new feature. Make sure you are running ScyllaDB 5.3.0 or a later version. Older versions will not support these features.
  - Index Options: Double-check the similarity_function and dim options in your CREATE CUSTOM INDEX statement. They must match the expected values.
  - Case Sensitivity: ScyllaDB keyspace and table names are typically lowercase. Ensure consistency.

Summary

Congratulations! You’ve successfully built a functional movie recommendation system powered by USearch and ScyllaDB’s integrated vector search.

Here are the key takeaways from this chapter:

Embeddings are fundamental: They transform complex data like movie descriptions into numerical vectors, enabling semantic comparisons.
Vector Search is crucial for relevance: USearch, integrated into ScyllaDB, provides the high-performance Approximate Nearest Neighbor (ANN) search needed for real-time recommendations.
ScyllaDB offers scalable vector storage: Its VECTOR data type and ANN OF query syntax simplify storing and querying embeddings at scale.
Hybrid search is powerful: You can combine vector similarity search with traditional database filtering (WHERE clauses) for even more precise recommendations.
Practical application: You’ve seen how these technologies come together to create intelligent, personalized user experiences.

What’s next? You could expand this project by:

Integrating with a real movie dataset (e.g., MovieLens).
Building a simple web UI for your recommender.
Experimenting with different embedding models and similarity metrics.
Exploring advanced ScyllaDB features for performance tuning.

Keep exploring, keep building, and remember the power of vector search in shaping the future of AI applications!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Chapter 13: Building a Movie Recommendation System

Table of Contents

Core Concepts: The Recommendation Engine Blueprint

2.1. Embeddings: The Language of Similarity

2.2. Vector Search: Finding the Needle in the Haystack

2.3. ScyllaDB’s Role: Scalable Vector Storage

2.4. System Architecture: A Visual Overview

Step-by-Step Implementation: Building Our Movie Recommender

3.1. Setting Up Our Environment

3.2. ScyllaDB Schema for Movies

3.3. Generating Movie Embeddings

3.4. Building the Recommendation Logic

Mini-Challenge: Enhance the Search

Common Pitfalls & Troubleshooting

Summary

References