The page shows you how to use Cloud Firestore to perform K-nearest neighbor (KNN) vector searches using the following techniques:
- Store vector values
- Create and manage KNN vector indexes
- Make a K-nearest-neighbor (KNN) query using one of the supported vector distance measures
Store vector embeddings
You can create vector values such as text embeddings from your Cloud Firestore data, and store them in Cloud Firestore documents.
Write operation with a vector embedding
The following example shows how to store a vector embedding in a Cloud Firestore document:
Python
Node.js
import { Firestore, FieldValue, } from "@google-cloud/firestore"; const db = new Firestore(); const coll = db.collection('coffee-beans'); await coll.add({ name: "Kahawa coffee beans", description: "Information about the Kahawa coffee beans.", embedding_field: FieldValue.vector([1.0 , 2.0, 3.0]) });
Compute vector embeddings with a Cloud Function
To calculate and store vector embeddings whenever a document is updated or created, you can set up a Cloud Function:
Python
@functions_framework.cloud_event def store_embedding(cloud_event) -> None: """Triggers by a change to a Firestore document. """ firestore_payload = firestore.DocumentEventData() payload = firestore_payload._pb.ParseFromString(cloud_event.data) collection_id, doc_id = from_payload(payload) # Call a function to calculate the embedding embedding = calculate_embedding(payload) # Update the document doc = firestore_client.collection(collection_id).document(doc_id) doc.set({"embedding_field": embedding}, merge=True)
Node.js
/** * A vector embedding will be computed from the * value of the `content` field. The vector value * will be stored in the `embedding` field. The * field names `content` and `embedding` are arbitrary * field names chosen for this example. */ async function storeEmbedding(event: FirestoreEvent<any>): Promise<void> { // Get the previous value of the document's `content` field. const previousDocumentSnapshot = event.data.before as QueryDocumentSnapshot; const previousContent = previousDocumentSnapshot.get("content"); // Get the current value of the document's `content` field. const currentDocumentSnapshot = event.data.after as QueryDocumentSnapshot; const currentContent = currentDocumentSnapshot.get("content"); // Don't update the embedding if the content field did not change if (previousContent === currentContent) { return; } // Call a function to calculate the embedding for the value // of the `content` field. const embeddingVector = calculateEmbedding(currentContent); // Update the `embedding` field on the document. await currentDocumentSnapshot.ref.update({ embedding: embeddingVector, }); }
Create and manage vector indexes
Before you can perform a nearest neighbor search with your vector embeddings, you must create a corresponding index. The following examples demonstrate how to create and manage vector indexes.
Create a vector index
Before you create a vector index, upgrade to the latest version of the Google Cloud CLI:
gcloud components update
To create a vector index, use gcloud firestore indexes composite create
:
gcloud
gcloud firestore indexes composite create \ --collection-group=collection-group \ --query-scope=COLLECTION \ --field-config field-path=vector-field,vector-config='vector-configuration' \ --database=database-id
where:
- collection-group is the ID of the collection group.
- vector-field is the name of the field that contains the vector embedding.
- database-id is the ID of the database.
- vector-configuration includes the vector
dimension
and index type. Thedimension
is an integer up to 2048. The index type must beflat
. Format the index configuration as follows:{"dimension":"DIMENSION", "flat": "{}"}
.
The following example creates a composite index, including a vector index for field vector-field
and an ascending index for field color
. You can use this type of index to pre-filter
data before a nearest neighbor search.
gcloud
gcloud firestore indexes composite create \ --collection-group=collection-group \ --query-scope=COLLECTION \ --field-config=order=ASCENDING,field-path="color" \ --field-config field-path=vector-field,vector-config='{"dimension":"1024", "flat": "{}"}' \ --database=database-id
List all vector indexes
gcloud
gcloud firestore indexes composite list --database=database-id
Replace database-id with the ID of the database.
Delete a vector index
gcloud
gcloud firestore indexes composite delete index-id --database=database-id
where:
- index-id is the ID of the index to delete.
Use
indexes composite list
to retrieve the index ID. - database-id is the ID of the database.
Describe a vector index
gcloud
gcloud firestore indexes composite describe index-id --database=database-id
where:
- index-id is the ID of the index to describe. Use or
indexes composite list
to retrieve the index ID. - database-id is the ID of the database.
Make a nearest-neighbor query
You can perform a similarity search to find the nearest neighbors of a vector embedding. Similarity searches require vector indexes. If an index doesn't exist, Cloud Firestore suggests an index to create using the gcloud CLI.
The following example finds 10 nearest neighbors of the query vector.
Python
Node.js
import { Firestore, FieldValue, VectorQuery, VectorQuerySnapshot, } from "@google-cloud/firestore"; // Requires a single-field vector index const vectorQuery: VectorQuery = coll.findNearest({ vectorField: 'embedding_field', queryVector: [3.0, 1.0, 2.0], limit: 10, distanceMeasure: 'EUCLIDEAN' }); const vectorQuerySnapshot: VectorQuerySnapshot = await vectorQuery.get();
Vector distances
Nearest-neighbor queries support the following options for vector distance:
EUCLIDEAN
: Measures the EUCLIDEAN distance between the vectors. To learn more, see Euclidean.COSINE
: Compares vectors based on the angle between them which lets you measure similarity that isn't based on the vectors magnitude. We recommend usingDOT_PRODUCT
with unit normalized vectors instead of COSINE distance, which is mathematically equivalent with better performance. To learn more see Cosine similarity to learn more.DOT_PRODUCT
: Similar toCOSINE
but is affected by the magnitude of the vectors. To learn more, see Dot product.
Choose the distance measure
Depending on whether or not all your vector embeddings are normalized, you can determine which distance measure to use to find the distance measure. A normalized vector embedding has a magnitude (length) of exactly 1.0.
In addition, if you know which distance measure your model was trained with, use that distance measure to compute the distance between your vector embeddings.
Normalized data
If you have a dataset where all vector embeddings are normalized, then all three
distance measures provide the same semantic search results. In essence, although each
distance measure returns a different value, those values sort the same way. When
embeddings are normalized, DOT_PRODUCT
is usually the most computationally
efficient, but the difference is negligible in most cases. However, if your
application is highly performance sensitive, DOT_PRODUCT
might help with
performance tuning.
Non-normalized data
If you have a dataset where vector embeddings aren't normalized,
then it's not mathematically correct to use DOT_PRODUCT
as a distance
measure because dot product doesn't measure distance. Depending
on how the embeddings were generated and what type of search is preferred,
either the COSINE
or EUCLIDEAN
distance measure produces
search results that are subjectively better than the other distance measures.
Experimentation with either COSINE
or EUCLIDEAN
might
be necessary to determine which is best for your use case.
Unsure if data is normalized or non-normalized
If you're unsure whether or not your data is normalized and you want to use
DOT_PRODUCT
, we recommend that you use COSINE
instead.
COSINE
is like DOT_PRODUCT
with normalization built in.
Distance measured using COSINE
ranges from 0
to 2
. A result
that is close to 0
indicates the vectors are very similar.
Pre-filter documents
To pre-filter documents before finding the nearest neighbors, you can combine a
similarity search with other query operators. The and
and
or
composite filters are supported. For more information about supported field filters, see Query operators.
Python
Node.js
// Similarity search with pre-filter // Requires composite vector index const preFilteredVectorQuery: VectorQuery = coll .where("color", "==", "red") .findNearest({ vectorField: "embedding_field", queryVector: [3.0, 1.0, 2.0], limit: 5, distanceMeasure: "EUCLIDEAN", }); const vectorQueryResults = await preFilteredVectorQuery.get();
Retrieve the calculated vector distance
You can retrieve the calculated vector distance by assigning a
distance_result_field
output property name on the FindNearest
query, as
shown in the following example:
Python
Node.js
const vectorQuery: VectorQuery = coll.findNearest( { vectorField: 'embedding_field', queryVector: [3.0, 1.0, 2.0], limit: 10, distanceMeasure: 'EUCLIDEAN', distanceResultField: 'vector_distance' }); const snapshot: VectorQuerySnapshot = await vectorQuery.get(); snapshot.forEach((doc) => { console.log(doc.id, ' Distance: ', doc.get('vector_distance')); });
If you want to use a field mask to return a subset of document fields along with a distanceResultField
, then you must also include the value of distanceResultField
in the field mask, as shown in the following example:
Python
Node.js
const vectorQuery: VectorQuery = coll .select('color', 'vector_distance') .findNearest({ vectorField: 'embedding_field', queryVector: [3.0, 1.0, 2.0], limit: 10, distanceMeasure: 'EUCLIDEAN', distanceResultField: 'vector_distance' });
Specify a distance threshold
You can specify a similarity threshold that returns only documents within the threshold. The behavior of the threshold field depends on the distance measure you choose:
EUCLIDEAN
andCOSINE
distances limit the threshold to documents where distance is less than or equal to the specified threshold. These distance measures decrease as the vectors become more similar.DOT_PRODUCT
distance limits the threshold to documents where distance is greater than or equal to the specified threshold. Dot product distances increase as the vectors become more similar.
The following example shows how to specify a distance threshold to return up to 10 nearest documents that are, at most, 4.5 units away using the EUCLIDEAN
distance metric:
Python
Node.js
const vectorQuery: VectorQuery = coll.findNearest({ vectorField: 'embedding_field', queryVector: [3.0, 1.0, 2.0], limit: 10, distanceMeasure: 'EUCLIDEAN', distanceThreshold: 4.5 }); const snapshot: VectorQuerySnapshot = await vectorQuery.get(); snapshot.forEach((doc) => { console.log(doc.id); });
Limitations
As you work with vector embeddings, note the following limitations:
- The maximum supported embedding dimension is 2048. To store larger indexes, use dimensionality reduction.
- The maximum number of documents to return from a nearest-neighbor query is 1000.
- Vector search does not support real-time snapshot listeners.
- Only the Python and Node.js client libraries support vector search.
What's next
- Read about best practices for Cloud Firestore.
- Understand reads and writes at scale.