Introduction: The Art of Measuring Closeness
Welcome to Chapter 8! In our journey with USearch and ScyllaDB, we’ve learned how to transform data into numerical vectors and store them for lightning-fast searches. But what exactly does “search for similar vectors” truly mean? How do we define “similarity” in a world of numbers?
The answer lies in vector distance metrics. Just like you might measure the distance between two cities on a map, we need a way to quantify how “far apart” or “close together” two vectors are in their multi-dimensional space. The choice of metric is paramount, as it directly impacts the relevance and accuracy of your search results. A “similar” item according to one metric might be quite different according to another!
In this chapter, you’ll learn:
- What vector distance metrics are and why they’re so crucial for effective vector search.
- The most common metrics: Euclidean Distance, Cosine Similarity, and Dot Product, including their mathematical intuition and practical applications.
- How to select the appropriate metric for your specific data and use case.
- How USearch allows you to easily specify these metrics, and how ScyllaDB leverages them for its integrated vector search.
This chapter builds on your understanding of vector embeddings and basic USearch operations. Get ready to refine your search capabilities by mastering the art of measuring closeness!
Core Concepts: Defining Similarity in Vector Space
At its heart, vector search is about finding vectors that are “close” to a query vector. But “close” isn’t a universal term. It depends entirely on the mathematical function we use to calculate the distance or similarity between two vectors. Let’s dive into the core concepts.
What are Vector Distance Metrics?
Imagine you have two friends, Alice and Bob, and you know their GPS coordinates. How would you measure “how close” they are?
- You could draw a straight line between their current positions and measure its length. This is like Euclidean Distance.
- You could consider the angle formed by drawing lines from a central point (like the Earth’s center) to each of them. This is akin to Cosine Similarity, focusing on direction.
In vector search, a distance metric is a function that takes two vectors as input and returns a single numerical value representing their dissimilarity. Generally, a smaller distance implies greater similarity. Conversely, a similarity metric returns a value where a larger value implies greater similarity. It’s important to keep this distinction in mind as we explore.
Why Do They Matter in Vector Search?
The choice of distance metric directly influences which results your vector search returns as “most similar.” Different metrics emphasize different aspects of your data:
- Do you care about the absolute difference in values across all dimensions (e.g., for physical properties or raw sensor readings)?
- Or do you care more about the overall direction or topic represented by the vector, regardless of its “strength” or magnitude (e.g., for text or image embeddings)?
Selecting the right metric ensures your search results are truly relevant to your application’s definition of similarity.
Common Vector Distance Metrics
Let’s explore the most widely used metrics:
1. Euclidean Distance (L2 Distance)
- What it is: Often called L2 distance, this is the most intuitive measure of distance between two points in space. It’s the length of the straight line segment connecting the two points. Think of it as the “as the crow flies” distance.
- Intuition: It’s calculated by taking the square root of the sum of the squared differences between corresponding components of the two vectors.
- When to use it:
- When the magnitude of the vector components is meaningful and contributes to similarity.
- For data where absolute differences are important, such as geographic coordinates, sensor readings, or features where a larger value genuinely means “more” of something.
- When vectors are not normalized and their length (magnitude) carries information.
- USearch Context: USearch typically offers
MetricKind.L2sq(Squared Euclidean Distance). The square root operation is computationally expensive and doesn’t change the relative ranking of distances, so using the squared version is a common optimization for speed. A smallerL2sqvalue indicates higher similarity.
2. Cosine Similarity (Angular Distance)
- What it is: Cosine similarity measures the cosine of the angle between two vectors. It focuses purely on the direction of the vectors, ignoring their magnitude (length).
- Intuition: If two vectors point in exactly the same direction, the angle between them is 0 degrees, and its cosine is 1 (perfect similarity). If they point in opposite directions, the angle is 180 degrees, and its cosine is -1 (perfect dissimilarity). If they are orthogonal (perpendicular), the angle is 90 degrees, and its cosine is 0 (no similarity).
- When to use it:
- Extremely popular for text embeddings (like those from BERT or OpenAI models), image features, and recommendation systems.
- When you want to find items that are conceptually similar, regardless of how “strong” or “long” their vector representation is. For example, two short, well-written product reviews might be more similar in topic than a long, rambling one, even if the long one has a larger vector magnitude.
- Often used with normalized vectors (vectors with a length of 1).
- USearch Context: USearch uses
MetricKind.Cos. It returns1 - cosine_similarityas a distance. So, a smallerdistancevalue (closer to 0) means a highercosine_similarity(closer to 1), indicating greater similarity.
3. Dot Product (Inner Product)
- What it is: The dot product of two vectors is the sum of the products of their corresponding components. It measures how much one vector “goes in the direction” of another, taking into account both direction and magnitude.
- Intuition:
- If vectors are normalized (unit length), the dot product is exactly equal to the cosine similarity.
- If vectors are not normalized, a larger dot product means the vectors are more aligned and have larger magnitudes.
- When to use it:
- In recommendation systems where a user’s preference vector (magnitude reflecting strength of preference) combined with item vectors (direction reflecting item characteristics) is important.
- When both the direction and the magnitude of the vectors are meaningful for your definition of similarity.
- Be cautious: if vectors are unbounded and not normalized, a query vector might be “similar” to a very long, irrelevant vector just because of its magnitude.
- USearch Context: USearch uses
MetricKind.IP. For this metric, USearch typically returns-dot_productas the distance. Therefore, a smaller (more negative) distance value implies a larger dot product, indicating higher similarity.
Choosing the Right Metric: A Decision Flow
Selecting the optimal metric isn’t always straightforward, but this general flow can guide your decision:
Explanation:
- If the strength or scale of your vector components matters (e.g., a vector
[10, 20]is “stronger” than[1, 2]), then magnitude is important.- If these magnitudes are also bounded or normalized (e.g., all vectors have a maximum length or are scaled to unit length), Dot Product might be a good fit.
- If magnitudes are arbitrary and you care about the absolute difference across dimensions, Euclidean Distance is often better.
- If only the direction or topic matters, and the length of the vector is irrelevant or even misleading (common for many learned embeddings), then Cosine Similarity is usually the best choice.
How USearch and ScyllaDB Use Them
USearch provides direct control over the distance metric. When you create an Index object, you explicitly pass a MetricKind enum. This tells USearch how to compute distances during indexing and searching, allowing it to apply specific optimizations for each metric.
ScyllaDB’s Vector Search (generally available as of January 2026) integrates these concepts seamlessly. While the underlying implementation might leverage USearch or similar high-performance libraries, ScyllaDB abstracts this complexity. When you perform a vector search using the ANN OF query in CQL, you can specify the desired distance metric directly. This ensures that the search performed by ScyllaDB aligns with how your vectors were generated and how you define similarity for your application. It’s crucial that the metric used during the embedding generation process matches the metric you select in your ScyllaDB ANN OF query for consistent results.
Step-by-Step Implementation: USearch in Action with Different Metrics
Let’s get hands-on and see how different metrics affect search results using USearch. We’ll use a simple set of 3-dimensional vectors to represent conceptual items.
Prerequisites
- Python 3.8+
- The
usearchlibrary installed.- You can install it via pip:
pip install usearch==6.25.0(or check the USearch GitHub for the very latest stable version).
- You can install it via pip:
- The
numpylibrary:pip install numpy.
Step 1: Prepare Your Environment and Sample Vectors
First, let’s import the necessary libraries and define some sample vectors. These vectors are purely illustrative; in a real application, they would come from an embedding model.
import numpy as np
from usearch.index import Index, MetricKind, Metric
print(f"USearch version: {Index.version}") # Check the installed USearch version
# Define some sample 3-dimensional vectors
# These vectors are intentionally chosen to highlight differences
# 'apple' and 'orange' are somewhat close in all dimensions
# 'banana' is quite different
# 'fruit_bowl' is somewhat in between 'apple'/'orange' and 'banana'
vectors = {
"apple": np.array([0.1, 0.9, 0.2], dtype=np.float32),
"orange": np.array([0.2, 0.8, 0.3], dtype=np.float32),
"banana": np.array([0.7, 0.1, 0.8], dtype=np.float32),
"fruit_bowl": np.array([0.3, 0.7, 0.4], dtype=np.float32)
}
print("Sample vectors defined.")
Explanation:
- We import
numpyfor efficient array handling andIndex,MetricKind,Metricfromusearch.index. Index.versionhelps us confirm our installed USearch version. As of early 2026,6.25.0is a recent stable version.- Our
vectorsdictionary stores conceptual names mapped to their 3D float32 representations.
Step 2: Create an Index with Euclidean Distance (L2)
Now, let’s create a USearch index configured to use squared Euclidean distance (L2sq).
# Create an index for 3-dimensional vectors using Euclidean (L2) distance
# MetricKind.L2sq is used for performance, as sqrt doesn't change relative ranking
index_l2 = Index(ndim=3, metric=MetricKind.L2sq)
print(f"\nL2 Index created with {index_l2.metric} metric.")
Explanation:
Index(ndim=3, metric=MetricKind.L2sq)initializes our index.ndimspecifies the dimensionality of our vectors.MetricKind.L2sqtells USearch to calculate similarity based on the squared Euclidean distance. Remember, a smallerL2sqvalue means higher similarity.
Step 3: Add Vectors to the L2 Index
Let’s populate our L2 index with the sample vectors.
# Add vectors to the L2 index
for key, vec in vectors.items():
# Using hash(key) as a unique integer label for each vector
index_l2.add(label=hash(key), vector=vec)
print(f"Added {len(vectors)} vectors to L2 index.")
Explanation:
- We iterate through our
vectorsdictionary. index_l2.add(label=hash(key), vector=vec)inserts each vector. We usehash(key)to generate a unique integer label for each string key, which is required by USearch.
Step 4: Perform a Search with L2 Distance
Now, let’s query our L2 index using the “apple” vector and see what comes up as most similar.
query_vector_l2 = vectors["apple"]
# Search for the 2 most similar vectors to 'apple'
matches_l2 = index_l2.search(query_vector_l2, count=2)
print("\n--- Search results (L2 Distance) for 'apple' ---")
for label, distance in zip(matches_l2.labels, matches_l2.distances):
# Find the original key from our dictionary for display
original_key = next(key for key, val in vectors.items() if hash(key) == label)
print(f"Vector: '{original_key}', L2 Squared Distance: {distance:.4f}")
Explanation:
query_vector_l2 = vectors["apple"]sets our query.index_l2.search(query_vector_l2, count=2)retrieves the 2 closest vectors.- The output shows the original vector name and its squared Euclidean distance from “apple”. A smaller distance means it’s considered more similar by this metric. You’ll likely see ‘orange’ as the closest due to its similar component values.
Step 5: Create an Index with Cosine Similarity
Next, let’s create a new index, this time configured for Cosine Similarity.
# Create an index for 3-dimensional vectors using Cosine Similarity
index_cos = Index(ndim=3, metric=MetricKind.Cos)
print(f"\nCosine Index created with {index_cos.metric} metric.")
Explanation:
- We create another
Indexinstance, but this time withmetric=MetricKind.Cos. - Remember, for
MetricKind.Cos, USearch returns1 - cosine_similarityas the distance. So, a smaller distance (closer to 0) means a higher cosine similarity (closer to 1), indicating greater similarity.
Step 6: Add Vectors to the Cosine Index
Populate our Cosine index with the same vectors.
# Add vectors to the Cosine index
for key, vec in vectors.items():
index_cos.add(label=hash(key), vector=vec)
print(f"Added {len(vectors)} vectors to Cosine index.")
Explanation:
- The process of adding vectors is identical, as the vectors themselves haven’t changed, only the underlying distance calculation logic of the index.
Step 7: Perform a Search with Cosine Similarity
Let’s query the Cosine index with the “apple” vector and compare the results to the L2 search.
query_vector_cos = vectors["apple"]
# Search for the 2 most similar vectors to 'apple' using Cosine Similarity
matches_cos = index_cos.search(query_vector_cos, count=2)
print("\n--- Search results (Cosine Similarity) for 'apple' ---")
for label, distance in zip(matches_cos.labels, matches_cos.distances):
original_key = next(key for key, val in vectors.items() if hash(key) == label)
# Convert USearch's distance (1 - similarity) back to actual cosine similarity
similarity = 1 - distance
print(f"Vector: '{original_key}', Cosine Similarity: {similarity:.4f}")
Explanation:
- We perform the search similarly to the L2 index.
- The key difference in the output is how we interpret
distance. SinceMetricKind.Cosreturns1 - cosine_similarity, we convert it back tosimilarity = 1 - distancefor a more intuitive understanding of similarity (where 1 is perfect similarity). - Observe how the ranking or the absolute similarity values might differ from the Euclidean search, even with these simple vectors. This highlights the impact of metric choice.
Mini-Challenge: Experiment with Dot Product
It’s your turn to explore!
Challenge:
Create a new USearch index using MetricKind.IP (Dot Product). Add the same vectors from the example and perform a search for “banana”. Compare the results to the L2 and Cosine searches you’ve already performed.
Hint:
Remember that for MetricKind.IP, USearch typically returns -dot_product as the “distance.” This means a smaller (more negative) distance value actually indicates a larger dot product, and thus higher similarity. You might want to display the dot_product as -distance for clarity.
What to observe/learn:
- How the ranking of “most similar” vectors to “banana” changes when using Dot Product compared to Euclidean or Cosine.
- Pay close attention to the raw distance values returned by USearch and how you need to interpret them correctly for the Dot Product metric to understand similarity.
- Think about why ‘banana’ might have different nearest neighbors under Dot Product.
Take your time, try it out, and observe the fascinating differences!
Common Pitfalls & Troubleshooting
Even with a solid understanding of metrics, pitfalls can emerge. Here are a few common ones:
Choosing the Wrong Metric for Your Data:
- Pitfall: Using Euclidean distance when vector magnitudes are arbitrary (e.g., in text embeddings, where a longer text might just have a larger magnitude vector but not be more “similar” in topic), leading to irrelevant search results. Or, conversely, using Cosine when magnitude is crucial (e.g., for user preference strength).
- Troubleshooting: Always start by understanding your data and how your embedding model works.
- Ask: Does the length (magnitude) of my vectors carry meaningful information?
- Ask: Are my vectors normalized? If so, Cosine or Dot Product are often good.
- Ask: Am I looking for conceptual similarity (topic, theme) or absolute value similarity (exact feature match)?
- Best Practice: Experiment with different metrics on a small, representative dataset and evaluate the relevance of the top results.
Misunderstanding Normalized vs. Unnormalized Vectors:
- Pitfall: Using Cosine Similarity with unnormalized vectors, or Dot Product when vectors should be normalized but aren’t. This can lead to the magnitude unfairly influencing results.
- Troubleshooting:
- If your embedding model outputs normalized vectors (unit length), Cosine Similarity and Dot Product will behave very similarly (Dot Product will be Cosine Similarity).
- If your vectors are not normalized, and you want to ignore magnitude, explicitly normalize them to unit length before indexing (
vector / np.linalg.norm(vector)in NumPy) and then use Cosine Similarity. - If magnitudes are important, ensure your chosen metric (like L2 or raw Dot Product) handles them appropriately.
Misinterpreting USearch’s “Distance” Output:
- Pitfall: Assuming a smaller numerical
distancevalue returned by USearch always means “more similar” in the intuitive sense for all metrics without considering the metric’s specific interpretation. - Troubleshooting:
MetricKind.L2sq: Smallerdistancemeans more similar. (Distance is squared Euclidean).MetricKind.Cos: Smallerdistance(closer to 0) means higher similarity. (Distance is1 - cosine_similarity).MetricKind.IP: Smaller (more negative)distancemeans higher similarity. (Distance is typically-dot_product).
- Best Practice: Always refer to the USearch official documentation or the
MetricKindenum definitions for the precise interpretation of thedistancevalue for each metric.
- Pitfall: Assuming a smaller numerical
Summary
Fantastic work! You’ve navigated the crucial world of vector distance metrics, which are the unsung heroes behind accurate and relevant vector searches.
Here are the key takeaways from this chapter:
- Vector distance metrics define how “similarity” is calculated between vectors, directly impacting search results.
- Euclidean Distance (L2) measures the straight-line distance, sensitive to absolute differences and vector magnitudes. USearch uses
L2sqfor performance. - Cosine Similarity focuses on the angle between vectors, ideal for conceptual similarity where direction matters more than magnitude (common for text embeddings). USearch returns
1 - cosine_similarity. - Dot Product (IP) considers both magnitude and direction. For normalized vectors, it’s equivalent to Cosine Similarity. USearch typically returns
-dot_product. - Choosing the right metric is critical and depends on your data’s characteristics and your application’s definition of similarity.
- USearch provides explicit
MetricKindoptions for easy configuration. - ScyllaDB’s Vector Search integrates these metrics, allowing you to specify them directly in your
ANN OFCQL queries for powerful, distributed similarity search. - Common pitfalls include selecting the wrong metric, mishandling normalized/unnormalized vectors, and misinterpreting USearch’s distance outputs.
You now have a deeper understanding of how “similarity” is quantified in vector space, empowering you to make informed decisions for your USearch and ScyllaDB implementations.
What’s Next?
In the next chapter, we’ll shift our focus to advanced indexing strategies within USearch. We’ll explore techniques that go beyond basic indexing to optimize performance, handle massive datasets, and fine-tune the trade-off between search speed and accuracy. Get ready to scale your vector search capabilities!
References
- USearch GitHub Repository
- ScyllaDB Vector Search Overview
- ScyllaDB Working with Vector Search Documentation
- Wikipedia: Euclidean distance
- Wikipedia: Cosine similarity
- Wikipedia: Dot product
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.