Skip to Content
DocsRAGVector Databases

Vector Databases in Kastrax ✅

Vector databases are specialized storage systems designed to efficiently store, index, and query high-dimensional vector embeddings. They are a critical component of any RAG system, enabling fast similarity search across large collections of embedded documents.

Vector Database Architecture ✅

Kastrax provides a unified interface for working with vector databases, abstracting away the differences between various providers while still allowing access to provider-specific features. The architecture consists of:

  1. VectorDB Interface: A common API for all vector database operations
  2. Database Adapters: Provider-specific implementations of the interface
  3. Query Builders: Tools for constructing complex vector queries
  4. Metadata Management: Systems for storing and retrieving document metadata
  5. Filtering Capabilities: Methods for filtering results based on metadata

This architecture enables you to switch between vector database providers with minimal code changes, while still leveraging the unique capabilities of each provider.

Supported Vector Databases ✅

Kastrax supports a wide range of vector database providers, from self-hosted options to managed cloud services. Each provider has its own strengths and use cases.

PgVectorExample.kt
import ai.kastrax.rag.vectordb.PgVectorAdapter import ai.kastrax.rag.vectordb.IndexOptions import ai.kastrax.rag.document.EmbeddedChunk // Create a PostgreSQL vector database adapter val vectorDB = PgVectorAdapter( connectionString = System.getenv("POSTGRES_CONNECTION_STRING"), schema = "public" // Optional: specify schema ) // Create an index with the appropriate dimension val indexOptions = IndexOptions( indexName = "document_embeddings", dimensions = 1536, metric = "cosine" // Similarity metric: cosine, dot_product, or euclidean ) vectorDB.createIndex(indexOptions) // Store embeddings in the database vectorDB.upsert( indexName = "document_embeddings", chunks = embeddedChunks // List of EmbeddedChunk objects )

Using PostgreSQL with pgvector

PostgreSQL with the pgvector extension is a good solution for teams already using PostgreSQL who want to minimize infrastructure complexity. It provides solid performance for small to medium-sized collections and integrates well with existing PostgreSQL workflows.

Key features:

  • Supports multiple distance metrics (cosine, dot product, Euclidean)
  • Efficient indexing with HNSW and IVFFlat algorithms
  • Familiar SQL interface for complex queries
  • Transactional guarantees

For detailed setup instructions and best practices, see the official pgvector repository.

Vector Database Operations ✅

Once you’ve set up your vector database, you can perform various operations on your embedded documents.

Creating Indexes

Before storing embeddings, you need to create an index with the appropriate dimension size for your embedding model:

CreateIndex.kt
import ai.kastrax.rag.vectordb.VectorDBAdapter import ai.kastrax.rag.vectordb.IndexOptions // Create an index with the appropriate dimension val indexOptions = IndexOptions( indexName = "document_embeddings", dimensions = 1536, // Must match your embedding model's output dimension metric = "cosine", // Similarity metric: cosine, dot_product, or euclidean // Optional provider-specific settings shards = 1, // Number of shards (for distributed databases) replicas = 1 // Number of replicas (for high availability) ) // Create the index vectorDB.createIndex(indexOptions)

The dimension size must match the output dimension of your chosen embedding model. Common dimension sizes are:

  • OpenAI text-embedding-3-small: 1536 dimensions (or custom, e.g., 256)
  • DeepSeek embed-base: 1024 dimensions
  • Cohere embed-multilingual-v3: 1024 dimensions
  • Google text-embedding-004: 768 dimensions (or custom)

Important: Index dimensions cannot be changed after creation. To use a different model, you must delete and recreate the index with the new dimension size.

The most common operation is similarity search, which finds documents similar to a query embedding:

SimilaritySearch.kt
import ai.kastrax.rag.vectordb.VectorDBAdapter import ai.kastrax.rag.embedding.OpenAIEmbedder // Create an embedder for query embedding val embedder = OpenAIEmbedder() // Generate embedding for the query val query = "How does climate change affect agriculture?" val queryEmbedding = embedder.embedText(query) // Search for similar documents val searchResults = vectorDB.search( indexName = "document_embeddings", embedding = queryEmbedding, limit = 5, // Return top 5 results minScore = 0.7 // Only return results with similarity score >= 0.7 ) // Process the results searchResults.forEach { result -> println("Score: ${result.score}") println("Content: ${result.chunk.content.take(100)}...") println("Metadata: ${result.chunk.metadata}") println() }

Metadata Filtering

Most vector databases support filtering results based on metadata, allowing you to combine semantic search with traditional filtering:

FilteredSearch.kt
// Search with metadata filtering val searchResults = vectorDB.search( indexName = "document_embeddings", embedding = queryEmbedding, filter = mapOf( "category" to "science", // Only return documents with category="science" "year" to mapOf( "$gte" to 2020 // Only return documents with year >= 2020 ), "authors" to mapOf( "$in" to listOf("Smith", "Johnson") // Author must be in this list ) ), limit = 5 )

Note: Filter syntax varies by database provider. Check the specific adapter documentation for details on the supported filter operations.

Some vector databases support hybrid search, combining vector similarity with keyword or full-text search:

HybridSearch.kt
// Perform hybrid search (vector + keyword) val hybridResults = vectorDB.hybridSearch( indexName = "document_embeddings", embedding = queryEmbedding, text = "climate agriculture drought", // Keyword search vectorWeight = 0.7, // Weight for vector similarity (0.0-1.0) textWeight = 0.3, // Weight for text similarity (0.0-1.0) limit = 5 )

CRUD Operations

Vector databases support standard CRUD (Create, Read, Update, Delete) operations:

CrudOperations.kt
// Create/Update: Upsert embeddings (already seen above) val ids = vectorDB.upsert( indexName = "document_embeddings", chunks = embeddedChunks ) // Read: Get specific documents by ID val documents = vectorDB.get( indexName = "document_embeddings", ids = listOf("doc1", "doc2", "doc3") ) // Delete: Remove documents by ID vectorDB.delete( indexName = "document_embeddings", ids = listOf("doc1", "doc2") ) // Delete by filter: Remove documents matching criteria vectorDB.deleteByFilter( indexName = "document_embeddings", filter = mapOf("category" to "outdated") )

Naming Rules for Databases

Each vector database enforces specific naming conventions for indexes and collections to ensure compatibility and prevent conflicts.

Index names must:

  • Start with a letter or underscore
  • Contain only letters, numbers, and underscores
  • Example: my_index_123 is valid
  • Example: my-index is not valid (contains hyphen)

Upserting Embeddings

After creating an index, you can store embeddings along with their basic metadata:

store-embeddings.ts
// Store embeddings with their corresponding metadata await store.upsert({ indexName: 'myCollection', // index name vectors: embeddings, // array of embedding vectors metadata: chunks.map(chunk => ({ text: chunk.text, // The original text content id: chunk.id // Optional unique identifier })) });

The upsert operation:

  • Takes an array of embedding vectors and their corresponding metadata
  • Updates existing vectors if they share the same ID
  • Creates new vectors if they don’t exist
  • Automatically handles batching for large datasets

For complete examples of upserting embeddings in different vector stores, see the Upsert Embeddings guide.

Adding Metadata ✅

Vector stores support rich metadata (any JSON-serializable fields) for filtering and organization. Since metadata is stored with no fixed schema, use consistent field naming to avoid unexpected query results.

Important: Metadata is crucial for vector storage - without it, you’d only have numerical embeddings with no way to return the original text or filter results. Always store at least the source text as metadata.

// Store embeddings with rich metadata for better organization and filtering await store.upsert({ indexName: "myCollection", vectors: embeddings, metadata: chunks.map((chunk) => ({ // Basic content text: chunk.text, id: chunk.id, // Document organization source: chunk.source, category: chunk.category, // Temporal metadata createdAt: new Date().toISOString(), version: "1.0", // Custom fields language: chunk.language, author: chunk.author, confidenceScore: chunk.score, })), });

Key metadata considerations:

  • Be strict with field naming - inconsistencies like ‘category’ vs ‘Category’ will affect queries
  • Only include fields you plan to filter or sort by - extra fields add overhead
  • Add timestamps (e.g., ‘createdAt’, ‘lastUpdated’) to track content freshness

Best Practices ✅

Performance Optimization

  1. Choose the right vector database for your specific needs:

    • For small to medium datasets with existing PostgreSQL: Use pgvector
    • For large-scale production: Consider Pinecone, Qdrant, or MongoDB Atlas
    • For edge deployment: Consider LibSQL or Chroma
    • For serverless: Consider Upstash or Cloudflare Vectorize
  2. Optimize index configuration:

    • Use appropriate indexing algorithms (HNSW, IVFFlat, etc.) based on your dataset size
    • Balance recall vs. performance with index parameters
    • Consider sharding for very large datasets
  3. Batch operations:

    • Use batch operations for large insertions (the upsert method handles batching automatically)
    • Process embeddings in batches to avoid memory issues
    • Consider async operations for non-blocking performance

Data Management

  1. Metadata strategy:

    • Only store metadata you’ll query against
    • Be consistent with field naming (e.g., ‘category’ vs ‘Category’)
    • Add timestamps (e.g., ‘createdAt’, ‘lastUpdated’) to track content freshness
  2. Dimension management:

    • Match embedding dimensions to your model (e.g., 1536 for text-embedding-3-small)
    • Consider dimension reduction for large collections
    • Create indexes before bulk insertions
  3. Data lifecycle:

    • Implement TTL (Time-To-Live) for temporary data
    • Set up regular reindexing for optimal performance
    • Create backup strategies for critical data

Integration with RAG Pipeline ✅

Vector databases are a critical component of the RAG pipeline in Kastrax. Here’s how they integrate with other components:

CompleteRAGPipeline.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.chunking.RecursiveChunker import ai.kastrax.rag.embedding.OpenAIEmbedder import ai.kastrax.rag.vectordb.PineconeAdapter import ai.kastrax.rag.retrieval.RetrievalEngine import ai.kastrax.rag.generation.LLMGenerator // 1. Document Processing val document = Document.fromText("Your document content...") val chunker = RecursiveChunker() val chunks = chunker.chunk(document) // 2. Embedding Generation val embedder = OpenAIEmbedder() val embeddedChunks = embedder.embed(chunks) // 3. Vector Database Storage val vectorDB = PineconeAdapter() vectorDB.upsert("document_embeddings", embeddedChunks) // 4. Retrieval val retrievalEngine = RetrievalEngine(vectorDB) val query = "How does climate change affect agriculture?" val retrievedChunks = retrievalEngine.retrieve( query = query, indexName = "document_embeddings", limit = 5 ) // 5. Generation val generator = LLMGenerator() val answer = generator.generate( query = query, context = retrievedChunks.joinToString("\n\n") { it.chunk.content }, prompt = "Answer the question based on the provided context." ) println("Query: $query") println("Answer: $answer")

Conclusion ✅

Vector databases are a fundamental component of any RAG system, enabling efficient storage and retrieval of vector embeddings. Kastrax provides a unified interface for working with various vector database providers, allowing you to choose the right solution for your specific needs while maintaining a consistent API.

Key takeaways:

  1. Choose the right database based on your scale, performance requirements, and existing infrastructure
  2. Configure indexes properly to match your embedding model dimensions and similarity metrics
  3. Use metadata effectively to enable filtering and hybrid search capabilities
  4. Implement best practices for performance optimization and data management
  5. Integrate seamlessly with the rest of your RAG pipeline

By following these guidelines, you can build a robust and efficient vector storage system that forms the backbone of your RAG applications.

Last updated on