Skip to Content
DocsRAGChunking and Embedding

Document Chunking and Embedding in Kastrax ✅

Effective document processing is the foundation of any successful RAG system. Kastrax provides powerful tools for transforming raw documents into optimized chunks and high-quality embeddings that enable accurate retrieval.

Document Processing Overview ✅

The document processing pipeline in Kastrax consists of two main steps:

  1. Chunking: Splitting documents into semantically meaningful segments
  2. Embedding: Converting text chunks into vector representations

This process transforms raw documents into a format that can be efficiently stored in vector databases and retrieved based on semantic similarity.

Creating Documents ✅

Before processing, you need to create a Document instance from your content. Kastrax supports multiple document formats:

DocumentCreation.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.document.DocumentType // Create documents from different sources fun createDocuments() { // From plain text val textDocument = Document.fromText( content = "Your plain text content...", metadata = mapOf("source" to "text", "author" to "John Doe") ) // From HTML val htmlDocument = Document.fromHtml( content = "<html><body>Your HTML content...</body></html>", metadata = mapOf("source" to "web", "url" to "https://example.com") ) // From Markdown val markdownDocument = Document.fromMarkdown( content = "# Your Markdown content...", metadata = mapOf("source" to "github", "repository" to "kastrax/docs") ) // From JSON val jsonDocument = Document.fromJson( content = "{ \"key\": \"value\" }", metadata = mapOf("source" to "api", "endpoint" to "/data") ) // From PDF (using the PDF extension) val pdfDocument = Document.fromPdf( filePath = "path/to/document.pdf", metadata = mapOf("source" to "file", "author" to "Jane Smith") ) // From file (auto-detects format based on extension) val fileDocument = Document.fromFile( filePath = "path/to/document.docx", metadata = mapOf("source" to "file", "department" to "HR") ) }

Each document can include optional metadata that provides context about the document source, author, creation date, or any other relevant information. This metadata is preserved throughout the processing pipeline and can be used for filtering during retrieval.

Document Chunking ✅

Chunking is the process of splitting documents into smaller, semantically meaningful segments. Kastrax provides multiple chunking strategies optimized for different document types and use cases:

StrategyDescriptionBest For
RecursiveChunkerSmart splitting based on content structureGeneral purpose, preserves semantic units
CharacterChunkerSimple character-based splitsSimple text with uniform structure
TokenChunkerToken-aware splittingLLM-optimized chunks with precise token counts
MarkdownChunkerMarkdown-aware splittingMarkdown documents, preserves headings and structure
HtmlChunkerHTML structure-aware splittingWeb pages, preserves HTML elements
JsonChunkerJSON structure-aware splittingJSON data, preserves object boundaries
LatexChunkerLaTeX structure-aware splittingAcademic papers, preserves sections and formulas
SentenceChunkerSplits by sentencesNatural language text
ParagraphChunkerSplits by paragraphsArticles and essays

Basic Chunking Example

BasicChunking.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.chunking.RecursiveChunker import ai.kastrax.rag.chunking.ChunkingOptions // Create a document val document = Document.fromText( content = """Climate change poses significant challenges to global agriculture. Rising temperatures and changing precipitation patterns affect crop yields. Farmers must adapt to these changing conditions to ensure food security. New technologies and farming practices can help mitigate these effects.""".trimIndent(), metadata = mapOf("topic" to "climate change", "domain" to "agriculture") ) // Create a chunker with specific options val chunker = RecursiveChunker( options = ChunkingOptions( chunkSize = 100, // Target size in tokens chunkOverlap = 20, // Overlap between chunks separator = "\n", // Preferred split points keepSeparator = false, // Whether to include separators in chunks extractMetadata = true // Whether to extract additional metadata ) ) // Chunk the document val chunks = chunker.chunk(document) // Process the chunks chunks.forEach { chunk -> println("Chunk: ${chunk.content.take(50)}...") println("Size: ${chunk.tokenCount} tokens") println("Metadata: ${chunk.metadata}") println() }

Advanced Chunking Options

Kastrax provides fine-grained control over the chunking process:

AdvancedChunking.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.chunking.ChunkerFactory import ai.kastrax.rag.chunking.ChunkingOptions import ai.kastrax.rag.chunking.ChunkingStrategy // Create a document val document = Document.fromMarkdown( content = """# Climate Change and Agriculture ## Introduction Climate change poses significant challenges to global agriculture. ## Effects on Crop Yields Rising temperatures and changing precipitation patterns affect crop yields. ## Adaptation Strategies Farmers must adapt to these changing conditions to ensure food security. ## Technological Solutions New technologies and farming practices can help mitigate these effects.""".trimIndent() ) // Create a chunker using the factory with advanced options val chunker = ChunkerFactory.create( strategy = ChunkingStrategy.MARKDOWN, options = ChunkingOptions( chunkSize = 150, chunkOverlap = 30, keepSeparator = true, extractMetadata = true, metadataExtractors = listOf( // Custom metadata extractors TitleExtractor(), KeywordsExtractor(), SummaryExtractor() ), customSplitters = mapOf( // Custom splitting rules "##" to SplitBehavior.ALWAYS_SPLIT, "*" to SplitBehavior.NEVER_SPLIT ) ) ) // Chunk the document val chunks = chunker.chunk(document)

Metadata Extraction

Kastrax can automatically extract metadata from chunks to enhance retrieval:

MetadataExtraction.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.chunking.ChunkerFactory import ai.kastrax.rag.chunking.ChunkingStrategy import ai.kastrax.rag.chunking.metadata.LlmMetadataExtractor // Create a document val document = Document.fromText(longArticleText) // Create a chunker with LLM-based metadata extraction val chunker = ChunkerFactory.create( strategy = ChunkingStrategy.RECURSIVE, options = ChunkingOptions( chunkSize = 200, extractMetadata = true, metadataExtractors = listOf( LlmMetadataExtractor( fields = listOf("title", "summary", "keywords", "entities"), llmProvider = "openai", modelName = "gpt-4" ) ) ) ) // Chunk the document with metadata extraction val chunks = chunker.chunk(document) // Access the extracted metadata chunks.forEach { chunk -> val title = chunk.metadata["title"] as? String val summary = chunk.metadata["summary"] as? String val keywords = chunk.metadata["keywords"] as? List<String> println("Title: $title") println("Summary: $summary") println("Keywords: $keywords") println() }

Note: LLM-based metadata extraction requires an API key for the selected provider.

Document Embedding ✅

After chunking, the next step is to convert text chunks into vector embeddings. These embeddings capture the semantic meaning of the text and enable similarity-based retrieval. Kastrax provides a flexible embedding system that supports multiple providers and models.

Embedding Providers

Kastrax supports several embedding providers out of the box:

ProviderModelsFeatures
OpenAItext-embedding-3-small, text-embedding-3-largeHigh quality, dimension control
DeepSeekdeepseek-embed-base, deepseek-embed-largeMultilingual support
Cohereembed-english-v3.0, embed-multilingual-v3.0Strong multilingual capabilities
HuggingFaceVarious open-source modelsSelf-hosted options
Vertex AItextembedding-gecko, textembedding-gecko-multilingualEnterprise-grade
LocalMiniLM, BGE, E5No API dependency

Basic Embedding Example

BasicEmbedding.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.chunking.RecursiveChunker import ai.kastrax.rag.embedding.OpenAIEmbedder import ai.kastrax.rag.embedding.EmbeddingOptions // Create a document and chunk it val document = Document.fromText("Climate change poses significant challenges...") val chunker = RecursiveChunker() val chunks = chunker.chunk(document) // Create an embedder val embedder = OpenAIEmbedder( options = EmbeddingOptions( modelName = "text-embedding-3-small", dimensions = 1536, // Default dimension apiKey = System.getenv("OPENAI_API_KEY") ) ) // Generate embeddings for all chunks val embeddedChunks = embedder.embed(chunks) // Access the embeddings embeddedChunks.forEach { chunk -> val embedding = chunk.embedding println("Chunk: ${chunk.content.take(30)}...") println("Embedding dimensions: ${embedding.size}") println("First 5 values: ${embedding.take(5)}") println() }

Using DeepSeek Embeddings

DeepSeekEmbedding.kt
import ai.kastrax.rag.embedding.DeepSeekEmbedder import ai.kastrax.rag.embedding.EmbeddingOptions // Create a DeepSeek embedder val embedder = DeepSeekEmbedder( options = EmbeddingOptions( modelName = "deepseek-embed-large", apiKey = System.getenv("DEEPSEEK_API_KEY"), batchSize = 10 // Process 10 chunks at a time ) ) // Generate embeddings val embeddedChunks = embedder.embed(chunks)

Using Local Embeddings

Kastrax supports local embedding models that run without API calls:

LocalEmbedding.kt
import ai.kastrax.rag.embedding.LocalEmbedder import ai.kastrax.rag.embedding.LocalEmbeddingOptions // Create a local embedder val embedder = LocalEmbedder( options = LocalEmbeddingOptions( modelName = "BAAI/bge-small-en-v1.5", cacheDir = "/path/to/model/cache", quantize = true // Use quantization for faster inference ) ) // Generate embeddings val embeddedChunks = embedder.embed(chunks)

Configuring Embedding Dimensions

Some embedding models support dimension reduction, which can help:

  • Decrease storage requirements in vector databases
  • Reduce computational costs for similarity searches
  • Improve retrieval performance in some cases
DimensionControl.kt
import ai.kastrax.rag.embedding.OpenAIEmbedder import ai.kastrax.rag.embedding.EmbeddingOptions // Create an embedder with reduced dimensions val embedder = OpenAIEmbedder( options = EmbeddingOptions( modelName = "text-embedding-3-small", dimensions = 256, // Reduced from default 1536 apiKey = System.getenv("OPENAI_API_KEY") ) ) // Generate embeddings with reduced dimensions val embeddedChunks = embedder.embed(chunks)

Batch Processing

For large document collections, Kastrax provides efficient batch processing:

BatchEmbedding.kt
import ai.kastrax.rag.embedding.EmbeddingBatcher import ai.kastrax.rag.embedding.OpenAIEmbedder // Create an embedder val embedder = OpenAIEmbedder() // Create a batcher for efficient processing val batcher = EmbeddingBatcher( embedder = embedder, batchSize = 20, // Process 20 chunks per batch maxRetries = 3, // Retry failed batches up to 3 times concurrency = 2 // Process 2 batches concurrently ) // Process a large collection of chunks val embeddedChunks = batcher.embedBatches(largeChunkCollection)

Vector Database Compatibility

When storing embeddings in a vector database, ensure that:

  1. The vector database index is configured with the same dimensions as your embeddings
  2. The similarity metric (cosine, dot product, euclidean) is consistent between embedding and retrieval
  3. The vector database supports the embedding format (typically float32 arrays)

Kastrax provides utilities to help with vector database integration:

VectorDBIntegration.kt
import ai.kastrax.rag.vectordb.VectorDBAdapter import ai.kastrax.rag.vectordb.PineconeAdapter // Create a vector database adapter val vectorDB = PineconeAdapter( indexName = "document-embeddings", dimensions = 1536, metric = "cosine", apiKey = System.getenv("PINECONE_API_KEY") ) // Store embedded chunks val ids = vectorDB.upsert(embeddedChunks) // Later, retrieve similar chunks val query = "How does climate change affect agriculture?" val queryEmbedding = embedder.embedText(query) val similarChunks = vectorDB.search( embedding = queryEmbedding, limit = 5, minScore = 0.7 )

Complete RAG Pipeline ✅

Here’s a complete example showing the entire document processing pipeline from raw text to vector database storage:

CompletePipeline.kt
import ai.kastrax.rag.document.Document import ai.kastrax.rag.chunking.RecursiveChunker import ai.kastrax.rag.chunking.ChunkingOptions import ai.kastrax.rag.embedding.OpenAIEmbedder import ai.kastrax.rag.embedding.EmbeddingOptions import ai.kastrax.rag.vectordb.PineconeAdapter import kotlinx.coroutines.runBlocking fun main() = runBlocking { // 1. Create a document from raw text val document = Document.fromText(""" # Climate Change and Agriculture Climate change poses significant challenges to global agriculture. Rising temperatures and changing precipitation patterns affect crop yields. Farmers must adapt to these changing conditions to ensure food security. ## Effects on Crop Yields Studies have shown that for each degree Celsius of warming, there is a potential reduction in global yields of wheat by 6%, rice by 3.2%, maize by 7.4%, and soybean by 3.1%. ## Adaptation Strategies Farmers are implementing various strategies to adapt to climate change: - Drought-resistant crop varieties - Improved irrigation systems - Diversification of crops - Precision agriculture techniques ## Technological Solutions New technologies can help mitigate the effects of climate change on agriculture: - Weather forecasting systems - Satellite monitoring - AI-powered crop management - Greenhouse innovations """.trimIndent()) // 2. Configure and create a chunker val chunker = RecursiveChunker( options = ChunkingOptions( chunkSize = 200, chunkOverlap = 50, separator = "\n\n", extractMetadata = true ) ) // 3. Chunk the document val chunks = chunker.chunk(document) println("Created ${chunks.size} chunks from the document") // 4. Configure and create an embedder val embedder = OpenAIEmbedder( options = EmbeddingOptions( modelName = "text-embedding-3-small", dimensions = 1536, apiKey = System.getenv("OPENAI_API_KEY") ) ) // 5. Generate embeddings for all chunks val embeddedChunks = embedder.embed(chunks) println("Generated embeddings for ${embeddedChunks.size} chunks") // 6. Configure and create a vector database adapter val vectorDB = PineconeAdapter( indexName = "climate-agriculture", dimensions = 1536, metric = "cosine", apiKey = System.getenv("PINECONE_API_KEY") ) // 7. Store the embedded chunks in the vector database val ids = vectorDB.upsert(embeddedChunks) println("Stored ${ids.size} vectors in the database") // 8. Perform a test query val query = "How does climate change affect wheat production?" println("\nQuerying: $query") // 9. Generate embedding for the query val queryEmbedding = embedder.embedText(query) // 10. Search for similar chunks val searchResults = vectorDB.search( embedding = queryEmbedding, limit = 3, minScore = 0.7, filter = mapOf("domain" to "agriculture") ) // 11. Display the results println("\nSearch Results:") searchResults.forEachIndexed { index, result -> println("\nResult ${index + 1} (Score: ${result.score})") println("Content: ${result.chunk.content}") println("Metadata: ${result.chunk.metadata}") } }

Alternative Embedding Providers ✅

Kastrax supports multiple embedding providers. Here’s how to use different providers in your pipeline:

Using DeepSeek

DeepSeekPipeline.kt
// Create a DeepSeek embedder val embedder = DeepSeekEmbedder( options = EmbeddingOptions( modelName = "deepseek-embed-large", apiKey = System.getenv("DEEPSEEK_API_KEY") ) ) // Generate embeddings val embeddedChunks = embedder.embed(chunks)

Using Cohere

CoherePipeline.kt
// Create a Cohere embedder val embedder = CohereEmbedder( options = EmbeddingOptions( modelName = "embed-multilingual-v3.0", apiKey = System.getenv("COHERE_API_KEY") ) ) // Generate embeddings val embeddedChunks = embedder.embed(chunks)

Using Local Models

LocalModelPipeline.kt
// Create a local embedder val embedder = LocalEmbedder( options = LocalEmbeddingOptions( modelName = "BAAI/bge-small-en-v1.5", cacheDir = "/path/to/model/cache" ) ) // Generate embeddings val embeddedChunks = embedder.embed(chunks)

Best Practices ✅

Chunking Best Practices

  1. Choose the right chunking strategy for your document type (recursive for general text, markdown for markdown documents, etc.)
  2. Experiment with chunk size - smaller chunks (100-300 tokens) work better for precise retrieval, larger chunks (500-1000 tokens) provide more context
  3. Use appropriate overlap (typically 10-20% of chunk size) to prevent information loss at chunk boundaries
  4. Extract metadata to enhance retrieval with additional context
  5. Preserve document structure by using structure-aware chunkers for formatted documents

Embedding Best Practices

  1. Choose the right embedding model based on your language requirements and quality needs
  2. Consider dimension reduction for large document collections to reduce storage and computation costs
  3. Use batch processing for large collections to optimize throughput
  4. Cache embeddings to avoid regenerating them for unchanged documents
  5. Monitor embedding quality by testing retrieval performance with representative queries

For more information on document processing and embeddings in Kastrax, see:

Last updated on