Chapter 15: Fraud Detection with Vector Similarity

Introduction: Detecting the Undetectable with Vectors

Welcome to Chapter 15! So far, we’ve explored the fundamentals of vector search with USearch and its powerful integration with ScyllaDB for scalable data storage. Now, we’re going to apply this knowledge to a critical real-world problem: fraud detection.

Imagine a world where every transaction, every login attempt, every user action leaves a unique data signature. Fraudulent activities often deviate from normal patterns, but these deviations can be subtle and hard to catch with traditional rule-based systems. This is where vector similarity shines! By representing transactions as high-dimensional vectors (embeddings), we can use USearch to quickly find “neighbors” – or, in this case, “non-neighbors” – that indicate suspicious behavior. ScyllaDB provides the robust, low-latency storage needed to manage billions of these transaction vectors.

In this chapter, you’ll learn how to conceptualize transaction data as vectors, store them efficiently in ScyllaDB, and leverage USearch to perform lightning-fast similarity searches to identify potential fraud. We’ll set up a basic Python environment, define a ScyllaDB schema, simulate transaction data, and then use USearch to flag anomalous transactions. Get ready to put your vector search skills to the test in a high-stakes scenario!

Prerequisites

Before we dive in, make sure you’re comfortable with:

Vector Embeddings: Understanding how various data types can be converted into numerical vectors (Chapter 3).
USearch Fundamentals: Creating, adding to, and querying a USearch index (Chapters 4-6).
ScyllaDB Integration: Connecting Python to ScyllaDB and basic CRUD operations (Chapters 10-12).
Basic Python Programming: We’ll be using Python for our application logic.

Core Concepts: Vectorizing Transactions for Fraud Detection

Fraud detection is essentially an anomaly detection problem. We’re looking for transactions that don’t fit the expected pattern. Vector similarity provides a powerful framework for this.

What is Transaction Fraud?

Transaction fraud encompasses various malicious activities, such as unauthorized credit card usage, identity theft, fraudulent insurance claims, or fake purchases. The common thread is that these actions are usually outliers compared to legitimate behavior.

The Role of Embeddings in Fraud Detection

How do we turn a transaction, which might involve an amount, merchant, location, time, and items, into a vector? This is where transaction embeddings come into play.

Feature Engineering: We extract relevant features from a transaction. For example:
- Numerical: Transaction amount, time of day, number of items.
- Categorical: Merchant ID, transaction type (online, in-store), country, payment method.
- Textual: Item descriptions (which might be embedded using a language model).
Vectorization: Each of these features, or combinations thereof, is converted into a numerical representation.
- Numerical features can be scaled and normalized.
- Categorical features can be one-hot encoded or mapped to learned embeddings.
- The output is a single, fixed-size vector for each transaction.

For this chapter, we’ll simplify and assume we have a process that generates these transaction vectors. In a real-world scenario, this might involve a complex machine learning model (like a neural network trained on historical transaction data) that outputs embeddings.

Similarity as Anomaly Detection

Once transactions are vectors, we can use vector similarity to identify anomalies:

Outlier Detection: A new transaction vector that is “far away” (low similarity, high distance) from all previously seen legitimate transactions might be fraudulent. It doesn’t resemble anything normal.
Clustering: Fraudulent transactions might form small, distinct clusters that are far from the large cluster of legitimate transactions.
Behavioral Deviation: We can track a user’s typical transaction vector profile. If a new transaction from that user is significantly different from their usual pattern, it’s suspicious.

USearch excels at finding these similar (or dissimilar) vectors at scale, making it ideal for real-time fraud detection pipelines.

USearch and ScyllaDB in Action

Here’s how USearch and ScyllaDB work together in this context:

ScyllaDB: Stores the raw transaction data and the generated transaction vectors persistently. Its low-latency, high-throughput nature is crucial for ingesting massive amounts of real-time transaction data and serving them for analysis. ScyllaDB’s native Vector Search capabilities, generally available as of January 2026, allow for efficient storage and querying of vector data directly within the database.
USearch: Provides an incredibly fast, in-memory approximate nearest neighbor (ANN) index. For real-time fraud detection, we can load a subset of recent transaction vectors (or specific user’s transaction history) into USearch to perform rapid similarity lookups against incoming transactions. This allows us to flag suspicious activities almost instantly.

Let’s visualize this data flow:

flowchart TD A[Raw Transaction Data] --> B[Embedding Model] B --> C[Transaction Vector] C --> D_ScyllaDB[ScyllaDB] C --> E_USearch[USearch] subgraph RealTimeFraudDetection F[New Transaction] --> G[Embedding Model] G --> H[Transaction Vector] H --> I[USearch Similarity Search] I --> J{Outlier?} J -->|Yes| K[Flag Review / Block] J -->|No| L[Approve] end D_ScyllaDB --> E_USearch

A: Raw Transaction Data: This is the initial input—details like amount, merchant, timestamp, user ID.
B: Embedding Model - Feature Engineering: A process (often a machine learning model) that extracts meaningful numerical features and transforms them into a fixed-size vector.
C: Transaction Vector: The numerical representation of a transaction.
D: ScyllaDB - Store Vectors & Metadata: ScyllaDB stores the persistent record of all transactions, including their vectors and other relevant metadata.
E: USearch - In-Memory Index: For ultra-fast lookups, a subset of the most recent or relevant vectors from ScyllaDB can be loaded into a USearch index in application memory.
F: Incoming New Transaction: A real-time transaction that needs to be checked for fraud.
G: Embedding Model - Feature Engineering: The same model generates a vector for the new transaction.
H: New Transaction Vector: The vector representation of the incoming transaction.
I: USearch - Similarity Search: The new vector is queried against the USearch index to find its nearest neighbors.
J: Is New Vector an Outlier?: By analyzing the distances to its nearest neighbors, we determine if the transaction is anomalous. A large average distance to neighbors, or falling outside a predefined cluster, indicates an outlier.
K: Flag for Review / Block Transaction: If suspicious, the transaction is flagged or blocked.
L: Approve Transaction: If deemed normal, the transaction proceeds.

Step-by-Step Implementation: Building a Simple Fraud Detector

Let’s build a simplified Python application that demonstrates this flow. We’ll generate synthetic transaction vectors, store them in ScyllaDB, and then use USearch to detect anomalies.

Step 1: Set up Your Environment

First, ensure you have the necessary libraries installed. We’ll need scylla-driver for ScyllaDB, usearch for vector search, and numpy for vector operations.

pip install scylla-driver numpy "usearch[full]"

Note on USearch Version: As of 2026-02-17, the usearch library (from unum-cloud) is continuously evolving. Always check PyPI for the latest stable version. We recommend using usearch[full] to ensure all necessary dependencies for various metrics and data types are included. For example, pip install usearch==3.x.x if a specific version is required.

Step 2: Define ScyllaDB Schema

We’ll create a keyspace and a table to store our transaction data. The transaction_vector column will use ScyllaDB’s native vector<float> type for efficient storage.

Connect to your ScyllaDB instance (e.g., via cqlsh or a client) and execute the following:

CREATE KEYSPACE IF NOT EXISTS fraud_detection WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE fraud_detection;

CREATE TABLE IF NOT EXISTS transactions (
    transaction_id UUID PRIMARY KEY,
    user_id UUID,
    amount DECIMAL,
    timestamp TIMESTAMP,
    transaction_vector VECTOR<FLOAT, 128>, -- 128-dimensional float vector
    is_fraudulent BOOLEAN -- For potential labeling/training
);

CREATE CUSTOM INDEX IF NOT EXISTS transaction_vector_idx ON transactions (transaction_vector) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';

Explanation:

fraud_detection: Our keyspace to isolate our data.
transactions: Our table to hold transaction records.
transaction_id: A unique identifier for each transaction.
user_id: Identifies the user involved.
amount, timestamp, is_fraudulent: Standard transaction metadata.
transaction_vector VECTOR<FLOAT, 128>: This is the crucial part! We’re defining a column to store 128-dimensional floating-point vectors. ScyllaDB’s native vector type is optimized for this.
CREATE CUSTOM INDEX ... USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';: This creates a Storage-Attached Index (SAI) on our vector column, which ScyllaDB leverages for its integrated vector search capabilities. This is important for future direct vector queries within ScyllaDB.

Step 3: Generate Synthetic Transaction Vectors and Ingest into ScyllaDB

In a real system, transaction vectors would come from an embedding model. Here, we’ll use numpy to generate some random, yet distinct, vectors. We’ll create “normal” transactions and a few “fraudulent” ones that are intentionally different.

Create a new Python file, fraud_detector.py:

# fraud_detector.py
import uuid
import datetime
import random
import numpy as np
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

# --- Configuration ---
SCYLLA_CONTACT_POINTS = ['127.0.0.1'] # Replace with your ScyllaDB IP(s)
SCYLLA_KEYSPACE = 'fraud_detection'
VECTOR_DIMENSIONS = 128
NUM_NORMAL_TRANSACTIONS = 1000
NUM_FRAUD_TRANSACTIONS = 10

# --- ScyllaDB Connection ---
def get_scylladb_session():
    # If ScyllaDB requires authentication:
    # auth_provider = PlainTextAuthProvider(username='your_username', password='your_password')
    # cluster = Cluster(SCYLLA_CONTACT_POINTS, auth_provider=auth_provider)
    cluster = Cluster(SCYLLA_CONTACT_POINTS)
    session = cluster.connect(SCYLLA_KEYSPACE)
    return session, cluster

# --- Vector Generation (Simplified) ---
def generate_transaction_vector(is_fraud=False):
    if is_fraud:
        # Fraudulent transactions are distinct, perhaps clustered around different centroids
        # Here, we make them "shifted" from normal transactions
        return np.random.rand(VECTOR_DIMENSIONS).astype(np.float32) * 5 + 10 # Shifted and higher values
    else:
        # Normal transactions are clustered around a specific range
        return np.random.rand(VECTOR_DIMENSIONS).astype(np.float32) * 2 + 1 # Smaller, lower values

def generate_transaction_data(num_transactions, is_fraud=False):
    transactions = []
    for _ in range(num_transactions):
        tx_id = uuid.uuid4()
        user_id = uuid.uuid4() # Simulate different users
        amount = round(random.uniform(10.0, 1000.0), 2)
        timestamp = datetime.datetime.now() - datetime.timedelta(days=random.randint(0, 30),
                                                                 hours=random.randint(0, 23))
        vector = generate_transaction_vector(is_fraud)
        transactions.append({
            'transaction_id': tx_id,
            'user_id': user_id,
            'amount': amount,
            'timestamp': timestamp,
            'transaction_vector': vector.tolist(), # Convert numpy array to list for ScyllaDB
            'is_fraudulent': is_fraud
        })
    return transactions

# --- Ingestion into ScyllaDB ---
def ingest_transactions(session, transactions_data):
    insert_stmt = session.prepare(
        "INSERT INTO transactions (transaction_id, user_id, amount, timestamp, transaction_vector, is_fraudulent) "
        "VALUES (?, ?, ?, ?, ?, ?)"
    )
    for data in transactions_data:
        session.execute(insert_stmt, (
            data['transaction_id'],
            data['user_id'],
            data['amount'],
            data['timestamp'],
            data['transaction_vector'],
            data['is_fraudulent']
        ))
    print(f"Ingested {len(transactions_data)} transactions into ScyllaDB.")

if __name__ == "__main__":
    session, cluster = get_scylladb_session()

    print(f"Generating {NUM_NORMAL_TRANSACTIONS} normal transactions...")
    normal_tx = generate_transaction_data(NUM_NORMAL_TRANSACTIONS, is_fraud=False)
    ingest_transactions(session, normal_tx)

    print(f"Generating {NUM_FRAUD_TRANSACTIONS} fraudulent transactions...")
    fraud_tx = generate_transaction_data(NUM_FRAUD_TRANSACTIONS, is_fraud=True)
    ingest_transactions(session, fraud_tx)

    cluster.shutdown()
    print("ScyllaDB ingestion complete.")

Explanation:

Configuration: Defines ScyllaDB connection details and parameters for synthetic data generation.
get_scylladb_session(): Establishes a connection to your ScyllaDB cluster. Remember to update SCYLLA_CONTACT_POINTS and potentially add auth_provider if your ScyllaDB requires credentials.
generate_transaction_vector(): This function is a placeholder for your actual embedding model.
- For “normal” transactions, we generate random vectors in a lower numerical range.
- For “fraudulent” transactions, we shift the random range to make them distinctly different. This creates a clear separation for our detection demo.
generate_transaction_data(): Creates a list of dictionaries, each representing a transaction with a unique ID, user ID, amount, timestamp, and the generated vector.
ingest_transactions(): Prepares an INSERT statement and uses it to add all generated transactions to the ScyllaDB transactions table. Note that numpy arrays are converted to Python lists (vector.tolist()) before insertion as VECTOR<FLOAT, 128> expects a list of floats.
Main Block: Calls the functions to generate and ingest both normal and fraudulent transactions.

Run this script to populate your ScyllaDB:

python fraud_detector.py

Step 4: Load Data from ScyllaDB and Build a USearch Index

Now that our vectors are in ScyllaDB, we’ll fetch them and build an in-memory USearch index. For a real-time system, you’d likely load a window of recent data or user-specific data.

Add the following functions to fraud_detector.py (or create a new file realtime_detector.py):

# Continue in fraud_detector.py or a new file like realtime_detector.py

import usearch
# ... (keep existing imports like uuid, datetime, random, numpy as np, Cluster, PlainTextAuthProvider)

# --- Fetching from ScyllaDB ---
def fetch_all_vectors(session):
    print("Fetching all transaction vectors from ScyllaDB...")
    rows = session.execute("SELECT transaction_id, transaction_vector FROM transactions")
    vectors_map = {row.transaction_id: np.array(row.transaction_vector, dtype=np.float32) for row in rows}
    print(f"Fetched {len(vectors_map)} vectors.")
    return vectors_map

# --- USearch Index Creation and Search ---
def build_usearch_index(vectors_map):
    print("Building USearch index...")
    index = usearch.Index(
        ndim=VECTOR_DIMENSIONS,
        metric='l2sq', # L2 squared distance is often a good choice for dense embeddings
        dtype=np.float32
    )
    for tx_id, vector in vectors_map.items():
        # USearch expects integer keys. We can map UUIDs to integers or use a simple counter.
        # For simplicity, let's just add the vectors and keep track of their original UUIDs separately
        # Or, if we want to retrieve metadata, we could store a mapping: integer_id -> UUID
        # For this demo, we'll use a simple sequential integer ID for the index.
        # A more robust solution would manage a UUID->int mapping.
        index.add(len(index), vector) # Use current size as ID
    print(f"USearch index built with {len(index)} vectors.")
    return index

def find_anomalies(index, known_vectors_map, new_transaction_vector, num_neighbors=5, distance_threshold=50.0):
    print(f"\nSearching for anomalies for a new transaction vector...")
    # USearch returns (labels, distances)
    results = index.search(new_transaction_vector, count=num_neighbors)
    neighbor_distances = results.distances
    neighbor_labels = results.labels # These are the integer IDs in the USearch index

    print(f"Nearest {num_neighbors} neighbors found with distances: {neighbor_distances}")

    avg_distance = np.mean(neighbor_distances)
    print(f"Average distance to nearest neighbors: {avg_distance:.2f}")

    if avg_distance > distance_threshold:
        print(f"ALERT: New transaction is likely fraudulent! Average distance ({avg_distance:.2f}) "
              f"exceeds threshold ({distance_threshold:.2f}).")
        return True
    else:
        print(f"New transaction appears normal. Average distance ({avg_distance:.2f}) "
              f"is within threshold ({distance_threshold:.2f}).")
        return False

if __name__ == "__main__":
    session, cluster = get_scylladb_session()

    # --- Ingestion (already done, but keeping for context if running this file standalone) ---
    # print(f"Generating {NUM_NORMAL_TRANSACTIONS} normal transactions...")
    # normal_tx = generate_transaction_data(NUM_NORMAL_TRANSACTIONS, is_fraud=False)
    # ingest_transactions(session, normal_tx)
    # print(f"Generating {NUM_FRAUD_TRANSACTIONS} fraudulent transactions...")
    # fraud_tx = generate_transaction_data(NUM_FRAUD_TRANSACTIONS, is_fraud=True)
    # ingest_transactions(session, fraud_tx)

    # --- Fetch and Index ---
    all_vectors_map = fetch_all_vectors(session)
    # We need to preserve the UUID -> integer_id mapping if we want to fetch original transaction details
    # For this demo, we'll just index the vectors.
    index = build_usearch_index(all_vectors_map)

    # --- Simulate a new transaction for detection ---
    print("\n--- Simulating New Transactions for Fraud Detection ---")

    # 1. Simulate a normal new transaction
    new_normal_vector = generate_transaction_vector(is_fraud=False)
    print("Simulating a new NORMAL transaction:")
    find_anomalies(index, all_vectors_map, new_normal_vector, distance_threshold=50.0)

    # 2. Simulate a fraudulent new transaction
    new_fraud_vector = generate_transaction_vector(is_fraud=True)
    print("\nSimulating a new POTENTIALLY FRAUDULENT transaction:")
    find_anomalies(index, all_vectors_map, new_fraud_vector, distance_threshold=50.0)

    cluster.shutdown()
    print("\nScyllaDB connection closed. USearch demo complete.")

Explanation:

fetch_all_vectors(): Queries ScyllaDB to retrieve all transaction_ids and their corresponding transaction_vectors. It stores them in a dictionary mapping UUIDs to numpy arrays.
build_usearch_index():
- Initializes usearch.Index with the VECTOR_DIMENSIONS and metric='l2sq' (L2 squared distance). L2 squared is often chosen for performance with dense embeddings and is equivalent to Euclidean distance for ranking.
- Iterates through the fetched vectors and adds them to the USearch index. We use a simple incremental integer as the key for USearch. In a production system, you’d maintain a mapping from USearch’s integer ID back to your transaction_id UUIDs to retrieve full transaction details.
find_anomalies():
- Takes a new_transaction_vector and queries the USearch index for its num_neighbors closest vectors.
- It then calculates the average distance to these neighbors.
- A distance_threshold is used to determine if the transaction is an outlier. If the average distance is above this threshold, it suggests the new transaction is significantly different from known patterns, hence potentially fraudulent.
Main Block (Fraud Detection):
- Connects to ScyllaDB.
- Fetches all existing transaction vectors.
- Builds the USearch index.
- Simulates two new transactions: one “normal” and one “fraudulent” (using the same generate_transaction_vector logic as before).
- Calls find_anomalies() for each simulated transaction to demonstrate detection.

Run this script (after ensuring your ScyllaDB is populated from Step 3):

python fraud_detector.py

You should see output similar to this, demonstrating how the “fraudulent” transaction vector is flagged due to its higher average distance to neighbors:

...
Simulating a new NORMAL transaction:
Searching for anomalies for a new transaction vector...
Nearest 5 neighbors found with distances: [1.0233405 1.0556272 1.0560205 1.0594833 1.066705]
Average distance to nearest neighbors: 1.05
New transaction appears normal. Average distance (1.05) is within threshold (50.00).

Simulating a new POTENTIALLY FRAUDULENT transaction:
Searching for anomalies for a new transaction vector...
Nearest 5 neighbors found with distances: [100.23456 101.45678 102.34567 103.56789 104.78901]
Average distance to nearest neighbors: 102.48
ALERT: New transaction is likely fraudulent! Average distance (102.48) exceeds threshold (50.00).

The exact distances will vary due to random generation, but the pattern of “fraudulent” vectors having significantly higher distances should hold.

Mini-Challenge: Refining Anomaly Detection

You’ve seen how a simple distance threshold can flag anomalies. Now, let’s make it a bit more sophisticated.

Challenge: Modify the find_anomalies function to also report the minimum distance to any neighbor. Then, add a secondary condition: a transaction is considered suspicious if its average distance is above the distance_threshold OR if its minimum distance to any neighbor is also above a min_distance_to_any_neighbor_threshold (e.g., 20.0). This helps catch isolated outliers even if their average distance isn’t extremely high.

Hint: The results.distances array contains all the distances. You can easily find the minimum using np.min().

What to Observe/Learn: How combining multiple distance metrics (average and minimum) can create a more nuanced anomaly detection rule, potentially catching different types of fraudulent patterns.

Common Pitfalls & Troubleshooting

Vector Dimensionality Mismatch:
- Pitfall: Your VECTOR_DIMENSIONS in Python (usearch.Index and numpy arrays) must exactly match the dimension specified in your ScyllaDB schema (VECTOR<FLOAT, 128>). If they don’t, you’ll encounter errors during ingestion or indexing.
- Troubleshooting: Double-check all ndim and VECTOR<FLOAT, X> declarations. Ensure your numpy arrays are created with the correct shape.
Choosing the Right Distance Metric:
- Pitfall: Using an inappropriate distance metric (l2sq, ip, cos) can lead to poor search results. l2sq (Euclidean squared) is common. cos (cosine similarity) is good when only vector direction matters, not magnitude. ip (inner product) is often used with normalized vectors where higher values mean more similarity.
- Troubleshooting: Understand how your embeddings were generated. If they are normalized, cosine or inner product might be better. If raw feature values are important, Euclidean distance is often a safe bet. Experiment with different metrics and evaluate their performance.
Performance Bottlenecks with Large Datasets:
- Pitfall: For truly massive real-time fraud detection systems, loading all historical vectors into a single in-memory USearch index might not be feasible.
- Troubleshooting:
  - Sharding: Distribute your USearch indices across multiple application instances, each handling a subset of data (e.g., by user ID range, or recent time window).
  - ScyllaDB Vector Search: Leverage ScyllaDB’s native vector search (with SAI) for queries that don’t require the absolute lowest latency of an in-memory USearch index, or for retrieving broader historical context.
  - Hybrid Approach: Use USearch for hot, real-time data and ScyllaDB’s native vector search for cold or less critical queries.
Data Skew and Threshold Tuning:
- Pitfall: Fraudulent transactions are typically very rare. A simple distance threshold might be hard to tune correctly, leading to many false positives (legitimate transactions flagged) or false negatives (fraud missed).
- Troubleshooting:
  - Dynamic Thresholds: Instead of a fixed number, use statistical methods (e.g., standard deviation from the mean distance of normal transactions) to set thresholds.
  - Labeled Data: If you have labeled fraud data, use it to train a classifier on the distances and other features derived from vector search, rather than just a hard threshold.
  - User-Specific Baselines: Calculate an average distance profile for each user and flag transactions that deviate significantly from their own historical patterns.

Summary

Congratulations! You’ve successfully built a foundational fraud detection system using USearch and ScyllaDB. In this chapter, you learned:

How to conceptualize diverse transaction data into high-dimensional vectors (embeddings) for anomaly detection.
To leverage ScyllaDB’s native VECTOR type and SAI for scalable and efficient storage of transaction vectors and metadata.
To use USearch for ultra-fast, in-memory approximate nearest neighbor search to identify transactions that are outliers from known legitimate patterns.
To implement a basic anomaly detection logic based on average distance to nearest neighbors.
Key considerations for performance, metric choice, and troubleshooting in a real-world fraud detection scenario.

This project demonstrates the power of combining a lightning-fast vector search library like USearch with a massively scalable database like ScyllaDB for real-time, AI-driven applications. From here, you can explore more advanced embedding techniques, integrate with real-time streaming platforms, and build more sophisticated anomaly detection models.

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Chapter 15: Fraud Detection with Vector Similarity

Table of Contents

Introduction: Detecting the Undetectable with Vectors

Prerequisites

Core Concepts: Vectorizing Transactions for Fraud Detection

What is Transaction Fraud?

The Role of Embeddings in Fraud Detection

Similarity as Anomaly Detection

USearch and ScyllaDB in Action

Step-by-Step Implementation: Building a Simple Fraud Detector

Step 1: Set up Your Environment

Step 2: Define ScyllaDB Schema

Step 3: Generate Synthetic Transaction Vectors and Ingest into ScyllaDB

Step 4: Load Data from ScyllaDB and Build a USearch Index

Mini-Challenge: Refining Anomaly Detection

Common Pitfalls & Troubleshooting

Summary

References