Document Chunking and Embedding in Kastrax ✅
Effective document processing is the foundation of any successful RAG system. Kastrax provides powerful tools for transforming raw documents into optimized chunks and high-quality embeddings that enable accurate retrieval.
Document Processing Overview ✅
The document processing pipeline in Kastrax consists of two main steps:
- Chunking: Splitting documents into semantically meaningful segments
- Embedding: Converting text chunks into vector representations
This process transforms raw documents into a format that can be efficiently stored in vector databases and retrieved based on semantic similarity.
Creating Documents ✅
Before processing, you need to create a Document
instance from your content. Kastrax supports multiple document formats:
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.document.DocumentType
// Create documents from different sources
fun createDocuments() {
// From plain text
val textDocument = Document.fromText(
content = "Your plain text content...",
metadata = mapOf("source" to "text", "author" to "John Doe")
)
// From HTML
val htmlDocument = Document.fromHtml(
content = "<html><body>Your HTML content...</body></html>",
metadata = mapOf("source" to "web", "url" to "https://example.com")
)
// From Markdown
val markdownDocument = Document.fromMarkdown(
content = "# Your Markdown content...",
metadata = mapOf("source" to "github", "repository" to "kastrax/docs")
)
// From JSON
val jsonDocument = Document.fromJson(
content = "{ \"key\": \"value\" }",
metadata = mapOf("source" to "api", "endpoint" to "/data")
)
// From PDF (using the PDF extension)
val pdfDocument = Document.fromPdf(
filePath = "path/to/document.pdf",
metadata = mapOf("source" to "file", "author" to "Jane Smith")
)
// From file (auto-detects format based on extension)
val fileDocument = Document.fromFile(
filePath = "path/to/document.docx",
metadata = mapOf("source" to "file", "department" to "HR")
)
}
Each document can include optional metadata that provides context about the document source, author, creation date, or any other relevant information. This metadata is preserved throughout the processing pipeline and can be used for filtering during retrieval.
Document Chunking ✅
Chunking is the process of splitting documents into smaller, semantically meaningful segments. Kastrax provides multiple chunking strategies optimized for different document types and use cases:
Strategy | Description | Best For |
---|---|---|
RecursiveChunker | Smart splitting based on content structure | General purpose, preserves semantic units |
CharacterChunker | Simple character-based splits | Simple text with uniform structure |
TokenChunker | Token-aware splitting | LLM-optimized chunks with precise token counts |
MarkdownChunker | Markdown-aware splitting | Markdown documents, preserves headings and structure |
HtmlChunker | HTML structure-aware splitting | Web pages, preserves HTML elements |
JsonChunker | JSON structure-aware splitting | JSON data, preserves object boundaries |
LatexChunker | LaTeX structure-aware splitting | Academic papers, preserves sections and formulas |
SentenceChunker | Splits by sentences | Natural language text |
ParagraphChunker | Splits by paragraphs | Articles and essays |
Basic Chunking Example
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.chunking.RecursiveChunker
import ai.kastrax.rag.chunking.ChunkingOptions
// Create a document
val document = Document.fromText(
content = """Climate change poses significant challenges to global agriculture.
Rising temperatures and changing precipitation patterns affect crop yields.
Farmers must adapt to these changing conditions to ensure food security.
New technologies and farming practices can help mitigate these effects.""".trimIndent(),
metadata = mapOf("topic" to "climate change", "domain" to "agriculture")
)
// Create a chunker with specific options
val chunker = RecursiveChunker(
options = ChunkingOptions(
chunkSize = 100, // Target size in tokens
chunkOverlap = 20, // Overlap between chunks
separator = "\n", // Preferred split points
keepSeparator = false, // Whether to include separators in chunks
extractMetadata = true // Whether to extract additional metadata
)
)
// Chunk the document
val chunks = chunker.chunk(document)
// Process the chunks
chunks.forEach { chunk ->
println("Chunk: ${chunk.content.take(50)}...")
println("Size: ${chunk.tokenCount} tokens")
println("Metadata: ${chunk.metadata}")
println()
}
Advanced Chunking Options
Kastrax provides fine-grained control over the chunking process:
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.chunking.ChunkerFactory
import ai.kastrax.rag.chunking.ChunkingOptions
import ai.kastrax.rag.chunking.ChunkingStrategy
// Create a document
val document = Document.fromMarkdown(
content = """# Climate Change and Agriculture
## Introduction
Climate change poses significant challenges to global agriculture.
## Effects on Crop Yields
Rising temperatures and changing precipitation patterns affect crop yields.
## Adaptation Strategies
Farmers must adapt to these changing conditions to ensure food security.
## Technological Solutions
New technologies and farming practices can help mitigate these effects.""".trimIndent()
)
// Create a chunker using the factory with advanced options
val chunker = ChunkerFactory.create(
strategy = ChunkingStrategy.MARKDOWN,
options = ChunkingOptions(
chunkSize = 150,
chunkOverlap = 30,
keepSeparator = true,
extractMetadata = true,
metadataExtractors = listOf(
// Custom metadata extractors
TitleExtractor(),
KeywordsExtractor(),
SummaryExtractor()
),
customSplitters = mapOf(
// Custom splitting rules
"##" to SplitBehavior.ALWAYS_SPLIT,
"*" to SplitBehavior.NEVER_SPLIT
)
)
)
// Chunk the document
val chunks = chunker.chunk(document)
Metadata Extraction
Kastrax can automatically extract metadata from chunks to enhance retrieval:
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.chunking.ChunkerFactory
import ai.kastrax.rag.chunking.ChunkingStrategy
import ai.kastrax.rag.chunking.metadata.LlmMetadataExtractor
// Create a document
val document = Document.fromText(longArticleText)
// Create a chunker with LLM-based metadata extraction
val chunker = ChunkerFactory.create(
strategy = ChunkingStrategy.RECURSIVE,
options = ChunkingOptions(
chunkSize = 200,
extractMetadata = true,
metadataExtractors = listOf(
LlmMetadataExtractor(
fields = listOf("title", "summary", "keywords", "entities"),
llmProvider = "openai",
modelName = "gpt-4"
)
)
)
)
// Chunk the document with metadata extraction
val chunks = chunker.chunk(document)
// Access the extracted metadata
chunks.forEach { chunk ->
val title = chunk.metadata["title"] as? String
val summary = chunk.metadata["summary"] as? String
val keywords = chunk.metadata["keywords"] as? List<String>
println("Title: $title")
println("Summary: $summary")
println("Keywords: $keywords")
println()
}
Note: LLM-based metadata extraction requires an API key for the selected provider.
Document Embedding ✅
After chunking, the next step is to convert text chunks into vector embeddings. These embeddings capture the semantic meaning of the text and enable similarity-based retrieval. Kastrax provides a flexible embedding system that supports multiple providers and models.
Embedding Providers
Kastrax supports several embedding providers out of the box:
Provider | Models | Features |
---|---|---|
OpenAI | text-embedding-3-small, text-embedding-3-large | High quality, dimension control |
DeepSeek | deepseek-embed-base, deepseek-embed-large | Multilingual support |
Cohere | embed-english-v3.0, embed-multilingual-v3.0 | Strong multilingual capabilities |
HuggingFace | Various open-source models | Self-hosted options |
Vertex AI | textembedding-gecko, textembedding-gecko-multilingual | Enterprise-grade |
Local | MiniLM, BGE, E5 | No API dependency |
Basic Embedding Example
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.chunking.RecursiveChunker
import ai.kastrax.rag.embedding.OpenAIEmbedder
import ai.kastrax.rag.embedding.EmbeddingOptions
// Create a document and chunk it
val document = Document.fromText("Climate change poses significant challenges...")
val chunker = RecursiveChunker()
val chunks = chunker.chunk(document)
// Create an embedder
val embedder = OpenAIEmbedder(
options = EmbeddingOptions(
modelName = "text-embedding-3-small",
dimensions = 1536, // Default dimension
apiKey = System.getenv("OPENAI_API_KEY")
)
)
// Generate embeddings for all chunks
val embeddedChunks = embedder.embed(chunks)
// Access the embeddings
embeddedChunks.forEach { chunk ->
val embedding = chunk.embedding
println("Chunk: ${chunk.content.take(30)}...")
println("Embedding dimensions: ${embedding.size}")
println("First 5 values: ${embedding.take(5)}")
println()
}
Using DeepSeek Embeddings
import ai.kastrax.rag.embedding.DeepSeekEmbedder
import ai.kastrax.rag.embedding.EmbeddingOptions
// Create a DeepSeek embedder
val embedder = DeepSeekEmbedder(
options = EmbeddingOptions(
modelName = "deepseek-embed-large",
apiKey = System.getenv("DEEPSEEK_API_KEY"),
batchSize = 10 // Process 10 chunks at a time
)
)
// Generate embeddings
val embeddedChunks = embedder.embed(chunks)
Using Local Embeddings
Kastrax supports local embedding models that run without API calls:
import ai.kastrax.rag.embedding.LocalEmbedder
import ai.kastrax.rag.embedding.LocalEmbeddingOptions
// Create a local embedder
val embedder = LocalEmbedder(
options = LocalEmbeddingOptions(
modelName = "BAAI/bge-small-en-v1.5",
cacheDir = "/path/to/model/cache",
quantize = true // Use quantization for faster inference
)
)
// Generate embeddings
val embeddedChunks = embedder.embed(chunks)
Configuring Embedding Dimensions
Some embedding models support dimension reduction, which can help:
- Decrease storage requirements in vector databases
- Reduce computational costs for similarity searches
- Improve retrieval performance in some cases
import ai.kastrax.rag.embedding.OpenAIEmbedder
import ai.kastrax.rag.embedding.EmbeddingOptions
// Create an embedder with reduced dimensions
val embedder = OpenAIEmbedder(
options = EmbeddingOptions(
modelName = "text-embedding-3-small",
dimensions = 256, // Reduced from default 1536
apiKey = System.getenv("OPENAI_API_KEY")
)
)
// Generate embeddings with reduced dimensions
val embeddedChunks = embedder.embed(chunks)
Batch Processing
For large document collections, Kastrax provides efficient batch processing:
import ai.kastrax.rag.embedding.EmbeddingBatcher
import ai.kastrax.rag.embedding.OpenAIEmbedder
// Create an embedder
val embedder = OpenAIEmbedder()
// Create a batcher for efficient processing
val batcher = EmbeddingBatcher(
embedder = embedder,
batchSize = 20, // Process 20 chunks per batch
maxRetries = 3, // Retry failed batches up to 3 times
concurrency = 2 // Process 2 batches concurrently
)
// Process a large collection of chunks
val embeddedChunks = batcher.embedBatches(largeChunkCollection)
Vector Database Compatibility
When storing embeddings in a vector database, ensure that:
- The vector database index is configured with the same dimensions as your embeddings
- The similarity metric (cosine, dot product, euclidean) is consistent between embedding and retrieval
- The vector database supports the embedding format (typically float32 arrays)
Kastrax provides utilities to help with vector database integration:
import ai.kastrax.rag.vectordb.VectorDBAdapter
import ai.kastrax.rag.vectordb.PineconeAdapter
// Create a vector database adapter
val vectorDB = PineconeAdapter(
indexName = "document-embeddings",
dimensions = 1536,
metric = "cosine",
apiKey = System.getenv("PINECONE_API_KEY")
)
// Store embedded chunks
val ids = vectorDB.upsert(embeddedChunks)
// Later, retrieve similar chunks
val query = "How does climate change affect agriculture?"
val queryEmbedding = embedder.embedText(query)
val similarChunks = vectorDB.search(
embedding = queryEmbedding,
limit = 5,
minScore = 0.7
)
Complete RAG Pipeline ✅
Here’s a complete example showing the entire document processing pipeline from raw text to vector database storage:
import ai.kastrax.rag.document.Document
import ai.kastrax.rag.chunking.RecursiveChunker
import ai.kastrax.rag.chunking.ChunkingOptions
import ai.kastrax.rag.embedding.OpenAIEmbedder
import ai.kastrax.rag.embedding.EmbeddingOptions
import ai.kastrax.rag.vectordb.PineconeAdapter
import kotlinx.coroutines.runBlocking
fun main() = runBlocking {
// 1. Create a document from raw text
val document = Document.fromText("""
# Climate Change and Agriculture
Climate change poses significant challenges to global agriculture.
Rising temperatures and changing precipitation patterns affect crop yields.
Farmers must adapt to these changing conditions to ensure food security.
## Effects on Crop Yields
Studies have shown that for each degree Celsius of warming, there is a
potential reduction in global yields of wheat by 6%, rice by 3.2%,
maize by 7.4%, and soybean by 3.1%.
## Adaptation Strategies
Farmers are implementing various strategies to adapt to climate change:
- Drought-resistant crop varieties
- Improved irrigation systems
- Diversification of crops
- Precision agriculture techniques
## Technological Solutions
New technologies can help mitigate the effects of climate change on agriculture:
- Weather forecasting systems
- Satellite monitoring
- AI-powered crop management
- Greenhouse innovations
""".trimIndent())
// 2. Configure and create a chunker
val chunker = RecursiveChunker(
options = ChunkingOptions(
chunkSize = 200,
chunkOverlap = 50,
separator = "\n\n",
extractMetadata = true
)
)
// 3. Chunk the document
val chunks = chunker.chunk(document)
println("Created ${chunks.size} chunks from the document")
// 4. Configure and create an embedder
val embedder = OpenAIEmbedder(
options = EmbeddingOptions(
modelName = "text-embedding-3-small",
dimensions = 1536,
apiKey = System.getenv("OPENAI_API_KEY")
)
)
// 5. Generate embeddings for all chunks
val embeddedChunks = embedder.embed(chunks)
println("Generated embeddings for ${embeddedChunks.size} chunks")
// 6. Configure and create a vector database adapter
val vectorDB = PineconeAdapter(
indexName = "climate-agriculture",
dimensions = 1536,
metric = "cosine",
apiKey = System.getenv("PINECONE_API_KEY")
)
// 7. Store the embedded chunks in the vector database
val ids = vectorDB.upsert(embeddedChunks)
println("Stored ${ids.size} vectors in the database")
// 8. Perform a test query
val query = "How does climate change affect wheat production?"
println("\nQuerying: $query")
// 9. Generate embedding for the query
val queryEmbedding = embedder.embedText(query)
// 10. Search for similar chunks
val searchResults = vectorDB.search(
embedding = queryEmbedding,
limit = 3,
minScore = 0.7,
filter = mapOf("domain" to "agriculture")
)
// 11. Display the results
println("\nSearch Results:")
searchResults.forEachIndexed { index, result ->
println("\nResult ${index + 1} (Score: ${result.score})")
println("Content: ${result.chunk.content}")
println("Metadata: ${result.chunk.metadata}")
}
}
Alternative Embedding Providers ✅
Kastrax supports multiple embedding providers. Here’s how to use different providers in your pipeline:
Using DeepSeek
// Create a DeepSeek embedder
val embedder = DeepSeekEmbedder(
options = EmbeddingOptions(
modelName = "deepseek-embed-large",
apiKey = System.getenv("DEEPSEEK_API_KEY")
)
)
// Generate embeddings
val embeddedChunks = embedder.embed(chunks)
Using Cohere
// Create a Cohere embedder
val embedder = CohereEmbedder(
options = EmbeddingOptions(
modelName = "embed-multilingual-v3.0",
apiKey = System.getenv("COHERE_API_KEY")
)
)
// Generate embeddings
val embeddedChunks = embedder.embed(chunks)
Using Local Models
// Create a local embedder
val embedder = LocalEmbedder(
options = LocalEmbeddingOptions(
modelName = "BAAI/bge-small-en-v1.5",
cacheDir = "/path/to/model/cache"
)
)
// Generate embeddings
val embeddedChunks = embedder.embed(chunks)
Best Practices ✅
Chunking Best Practices
- Choose the right chunking strategy for your document type (recursive for general text, markdown for markdown documents, etc.)
- Experiment with chunk size - smaller chunks (100-300 tokens) work better for precise retrieval, larger chunks (500-1000 tokens) provide more context
- Use appropriate overlap (typically 10-20% of chunk size) to prevent information loss at chunk boundaries
- Extract metadata to enhance retrieval with additional context
- Preserve document structure by using structure-aware chunkers for formatted documents
Embedding Best Practices
- Choose the right embedding model based on your language requirements and quality needs
- Consider dimension reduction for large document collections to reduce storage and computation costs
- Use batch processing for large collections to optimize throughput
- Cache embeddings to avoid regenerating them for unchanged documents
- Monitor embedding quality by testing retrieval performance with representative queries
Related Resources ✅
For more information on document processing and embeddings in Kastrax, see:
- Vector Databases - Learn how to store and retrieve embeddings
- RAG Retrieval - Advanced retrieval techniques for RAG systems
- RAG Evaluation - Methods to evaluate and optimize your RAG pipeline
- Chunking Strategies - Detailed reference for all chunking strategies
- Embedding Models - Comprehensive guide to embedding models and providers