Vector Databases in Kastrax ✅
Vector databases are specialized storage systems designed to efficiently store, index, and query high-dimensional vector embeddings. They are a critical component of any RAG system, enabling fast similarity search across large collections of embedded documents.
Vector Database Architecture ✅
Kastrax provides a unified interface for working with vector databases, abstracting away the differences between various providers while still allowing access to provider-specific features. The architecture consists of:
- VectorDB Interface: A common API for all vector database operations
- Database Adapters: Provider-specific implementations of the interface
- Query Builders: Tools for constructing complex vector queries
- Metadata Management: Systems for storing and retrieving document metadata
- Filtering Capabilities: Methods for filtering results based on metadata
This architecture enables you to switch between vector database providers with minimal code changes, while still leveraging the unique capabilities of each provider.
Supported Vector Databases ✅
Kastrax supports a wide range of vector database providers, from self-hosted options to managed cloud services. Each provider has its own strengths and use cases.
Pg Vector
import ai.kastrax.rag.vectordb.PgVectorAdapter
import ai.kastrax.rag.vectordb.IndexOptions
import ai.kastrax.rag.document.EmbeddedChunk
// Create a PostgreSQL vector database adapter
val vectorDB = PgVectorAdapter(
connectionString = System.getenv("POSTGRES_CONNECTION_STRING"),
schema = "public" // Optional: specify schema
)
// Create an index with the appropriate dimension
val indexOptions = IndexOptions(
indexName = "document_embeddings",
dimensions = 1536,
metric = "cosine" // Similarity metric: cosine, dot_product, or euclidean
)
vectorDB.createIndex(indexOptions)
// Store embeddings in the database
vectorDB.upsert(
indexName = "document_embeddings",
chunks = embeddedChunks // List of EmbeddedChunk objects
)
Using PostgreSQL with pgvector
PostgreSQL with the pgvector extension is a good solution for teams already using PostgreSQL who want to minimize infrastructure complexity. It provides solid performance for small to medium-sized collections and integrates well with existing PostgreSQL workflows.
Key features:
- Supports multiple distance metrics (cosine, dot product, Euclidean)
- Efficient indexing with HNSW and IVFFlat algorithms
- Familiar SQL interface for complex queries
- Transactional guarantees
For detailed setup instructions and best practices, see the official pgvector repository .
Vector Database Operations ✅
Once you’ve set up your vector database, you can perform various operations on your embedded documents.
Creating Indexes
Before storing embeddings, you need to create an index with the appropriate dimension size for your embedding model:
import ai.kastrax.rag.vectordb.VectorDBAdapter
import ai.kastrax.rag.vectordb.IndexOptions
// Create an index with the appropriate dimension
val indexOptions = IndexOptions(
indexName = "document_embeddings",
dimensions = 1536, // Must match your embedding model's output dimension
metric = "cosine", // Similarity metric: cosine, dot_product, or euclidean
// Optional provider-specific settings
shards = 1, // Number of shards (for distributed databases)
replicas = 1 // Number of replicas (for high availability)
)
// Create the index
vectorDB.createIndex(indexOptions)
The dimension size must match the output dimension of your chosen embedding model. Common dimension sizes are:
- OpenAI text-embedding-3-small: 1536 dimensions (or custom, e.g., 256)
- DeepSeek embed-base: 1024 dimensions
- Cohere embed-multilingual-v3: 1024 dimensions
- Google text-embedding-004: 768 dimensions (or custom)
Important: Index dimensions cannot be changed after creation. To use a different model, you must delete and recreate the index with the new dimension size.
Similarity Search
The most common operation is similarity search, which finds documents similar to a query embedding:
import ai.kastrax.rag.vectordb.VectorDBAdapter
import ai.kastrax.rag.embedding.OpenAIEmbedder
// Create an embedder for query embedding
val embedder = OpenAIEmbedder()
// Generate embedding for the query
val query = "How does climate change affect agriculture?"
val queryEmbedding = embedder.embedText(query)
// Search for similar documents
val searchResults = vectorDB.search(
indexName = "document_embeddings",
embedding = queryEmbedding,
limit = 5, // Return top 5 results
minScore = 0.7 // Only return results with similarity score >= 0.7
)
// Process the results
searchResults.forEach { result ->
println("Score: ${result.score}")
println("Content: ${result.chunk.content.take(100)}...")
println("Metadata: ${result.chunk.metadata}")
println()
}
Metadata Filtering
Most vector databases support filtering results based on metadata, allowing you to combine semantic search with traditional filtering:
// Search with metadata filtering
val searchResults = vectorDB.search(
indexName = "document_embeddings",
embedding = queryEmbedding,
filter = mapOf(
"category" to "science", // Only return documents with category="science"
"year" to mapOf(
"$gte" to 2020 // Only return documents with year >= 2020
),
"authors" to mapOf(
"$in" to listOf("Smith", "Johnson") // Author must be in this list
)
),
limit = 5
)
Note: Filter syntax varies by database provider. Check the specific adapter documentation for details on the supported filter operations.
Hybrid Search
Some vector databases support hybrid search, combining vector similarity with keyword or full-text search:
// Perform hybrid search (vector + keyword)
val hybridResults = vectorDB.hybridSearch(
indexName = "document_embeddings",
embedding = queryEmbedding,
text = "climate agriculture drought", // Keyword search
vectorWeight = 0.7, // Weight for vector similarity (0.0-1.0)
textWeight = 0.3, // Weight for text similarity (0.0-1.0)
limit = 5
)
CRUD Operations
Vector databases support standard CRUD (Create, Read, Update, Delete) operations:
// Create/Update: Upsert embeddings (already seen above)
val ids = vectorDB.upsert(
indexName = "document_embeddings",
chunks = embeddedChunks
)
// Read: Get specific documents by ID
val documents = vectorDB.get(
indexName = "document_embeddings",
ids = listOf("doc1", "doc2", "doc3")
)
// Delete: Remove documents by ID
vectorDB.delete(
indexName = "document_embeddings",
ids = listOf("doc1", "doc2")
)
// Delete by filter: Remove documents matching criteria
vectorDB.deleteByFilter(
indexName = "document_embeddings",
filter = mapOf("category" to "outdated")
)
Naming Rules for Databases
Each vector database enforces specific naming conventions for indexes and collections to ensure compatibility and prevent conflicts.
Pg Vector
Index names must:
- Start with a letter or underscore
- Contain only letters, numbers, and underscores
- Example:
my_index_123
is valid - Example:
my-index
is not valid (contains hyphen)
Upserting Embeddings
After creating an index, you can store embeddings along with their basic metadata:
// Store embeddings with their corresponding metadata
await store.upsert({
indexName: 'myCollection', // index name
vectors: embeddings, // array of embedding vectors
metadata: chunks.map(chunk => ({
text: chunk.text, // The original text content
id: chunk.id // Optional unique identifier
}))
});
The upsert operation:
- Takes an array of embedding vectors and their corresponding metadata
- Updates existing vectors if they share the same ID
- Creates new vectors if they don’t exist
- Automatically handles batching for large datasets
For complete examples of upserting embeddings in different vector stores, see the Upsert Embeddings guide.
Adding Metadata ✅
Vector stores support rich metadata (any JSON-serializable fields) for filtering and organization. Since metadata is stored with no fixed schema, use consistent field naming to avoid unexpected query results.
Important: Metadata is crucial for vector storage - without it, you’d only have numerical embeddings with no way to return the original text or filter results. Always store at least the source text as metadata.
// Store embeddings with rich metadata for better organization and filtering
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({
// Basic content
text: chunk.text,
id: chunk.id,
// Document organization
source: chunk.source,
category: chunk.category,
// Temporal metadata
createdAt: new Date().toISOString(),
version: "1.0",
// Custom fields
language: chunk.language,
author: chunk.author,
confidenceScore: chunk.score,
})),
});
Key metadata considerations:
- Be strict with field naming - inconsistencies like ‘category’ vs ‘Category’ will affect queries
- Only include fields you plan to filter or sort by - extra fields add overhead
- Add timestamps (e.g., ‘createdAt’, ‘lastUpdated’) to track content freshness
Best Practices ✅
Performance Optimization
-
Choose the right vector database for your specific needs:
- For small to medium datasets with existing PostgreSQL: Use pgvector
- For large-scale production: Consider Pinecone, Qdrant, or MongoDB Atlas
- For edge deployment: Consider LibSQL or Chroma
- For serverless: Consider Upstash or Cloudflare Vectorize
-
Optimize index configuration:
- Use appropriate indexing algorithms (HNSW, IVFFlat, etc.) based on your dataset size
- Balance recall vs. performance with index parameters
- Consider sharding for very large datasets
-
Batch operations:
- Use batch operations for large insertions (the upsert method handles batching automatically)
- Process embeddings in batches to avoid memory issues
- Consider async operations for non-blocking performance
Data Management
-
Metadata strategy:
- Only store metadata you’ll query against
- Be consistent with field naming (e.g., ‘category’ vs ‘Category’)
- Add timestamps (e.g., ‘createdAt’, ‘lastUpdated’) to track content freshness
-
Dimension management:
- Match embedding dimensions to your model (e.g., 1536 for
text-embedding-3-small
) - Consider dimension reduction for large collections
- Create indexes before bulk insertions
- Match embedding dimensions to your model (e.g., 1536 for
-
Data lifecycle:
- Implement TTL (Time-To-Live) for temporary data
- Set up regular reindexing for optimal performance
- Create backup strategies for critical data
Integration with RAG Pipeline ✅
Vector databases are a critical component of the RAG pipeline in Kastrax. Here’s how they integrate with other components:
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.chunking.RecursiveChunker
import ai.kastrax.rag.embedding.OpenAIEmbedder
import ai.kastrax.rag.vectordb.PineconeAdapter
import ai.kastrax.rag.retrieval.RetrievalEngine
import ai.kastrax.rag.generation.LLMGenerator
// 1. Document Processing
val document = Document.fromText("Your document content...")
val chunker = RecursiveChunker()
val chunks = chunker.chunk(document)
// 2. Embedding Generation
val embedder = OpenAIEmbedder()
val embeddedChunks = embedder.embed(chunks)
// 3. Vector Database Storage
val vectorDB = PineconeAdapter()
vectorDB.upsert("document_embeddings", embeddedChunks)
// 4. Retrieval
val retrievalEngine = RetrievalEngine(vectorDB)
val query = "How does climate change affect agriculture?"
val retrievedChunks = retrievalEngine.retrieve(
query = query,
indexName = "document_embeddings",
limit = 5
)
// 5. Generation
val generator = LLMGenerator()
val answer = generator.generate(
query = query,
context = retrievedChunks.joinToString("\n\n") { it.chunk.content },
prompt = "Answer the question based on the provided context."
)
println("Query: $query")
println("Answer: $answer")
Conclusion ✅
Vector databases are a fundamental component of any RAG system, enabling efficient storage and retrieval of vector embeddings. Kastrax provides a unified interface for working with various vector database providers, allowing you to choose the right solution for your specific needs while maintaining a consistent API.
Key takeaways:
- Choose the right database based on your scale, performance requirements, and existing infrastructure
- Configure indexes properly to match your embedding model dimensions and similarity metrics
- Use metadata effectively to enable filtering and hybrid search capabilities
- Implement best practices for performance optimization and data management
- Integrate seamlessly with the rest of your RAG pipeline
By following these guidelines, you can build a robust and efficient vector storage system that forms the backbone of your RAG applications.