Adding Voice Capabilities to Kastrax AI Agents ✅
Kastrax AI Agents can be enhanced with sophisticated voice capabilities, enabling natural spoken interactions with users. The voice system supports text-to-speech (TTS) for generating spoken responses, speech-to-text (STT) for understanding user input, and real-time speech-to-speech for continuous conversations. This guide explains how to implement these capabilities in your Kastrax agents.
Voice Architecture in Kastrax ✅
Kastrax implements a sophisticated voice system with several key components:
- Voice Providers: Adapters for various speech services (DeepSeek, ElevenLabs, Google, etc.)
- Voice Interface: A unified API for text-to-speech and speech-to-text operations
- Realtime Voice: Support for continuous, bidirectional speech conversations
- Voice Events: An event system for monitoring and responding to voice interactions
- Voice Configuration: Extensive options for customizing voice characteristics
This architecture allows for flexible integration with different voice services while maintaining a consistent developer experience.
Using a Single Provider ✅
The simplest way to add voice to an agent is to use a single provider for both speaking and listening:
import ai.kastrax.core.agent.Agent
import ai.kastrax.core.agent.AgentConfig
import ai.kastrax.core.llm.DeepSeekProvider
import ai.kastrax.voice.DeepSeekVoice
import ai.kastrax.voice.config.SpeakOptions
import ai.kastrax.voice.config.ListenOptions
import ai.kastrax.voice.AudioFormat
import java.io.File
import kotlinx.coroutines.runBlocking
fun main() = runBlocking {
// Initialize the voice provider with default settings
val voice = DeepSeekVoice(
apiKey = "your-deepseek-api-key",
defaultVoice = "female-1" // Default voice
)
// Create an agent with voice capabilities
val agent = Agent(
config = AgentConfig(
name = "VoiceAssistant",
description = "An assistant with voice interaction capabilities",
instructions = "You are a helpful assistant with both speech recognition and speech synthesis capabilities.",
llmProvider = DeepSeekProvider(apiKey = "your-deepseek-api-key"),
voice = voice
)
)
// The agent can now use voice for interaction
val audioStream = agent.voice.speak(
text = "Hello, I'm your AI assistant!",
options = SpeakOptions(
format = AudioFormat.MP3,
voice = "male-2", // Override the default voice
speed = 1.0f, // Normal speaking speed
pitch = 0.0f // Normal pitch
)
)
// Save the audio to a file
val outputFile = File("assistant_greeting.mp3")
audioStream.use { input ->
outputFile.outputStream().use { output ->
input.copyTo(output)
}
}
println("Saved greeting to ${outputFile.absolutePath}")
// Transcribe audio from a file
try {
val inputFile = File("user_input.mp3")
val transcription = agent.voice.listen(
audioStream = inputFile.inputStream(),
options = ListenOptions(
format = AudioFormat.MP3,
language = "en-US", // Specify language for better accuracy
model = "standard" // Use standard recognition model
)
)
println("Transcription: $transcription")
} catch (e: Exception) {
println("Error transcribing audio: ${e.message}")
}
}
Using Multiple Providers ✅
For more flexibility, you can use different providers for speaking and listening using the CompositeVoice class. This allows you to leverage the strengths of different voice services:
import ai.kastrax.core.agent.Agent
import ai.kastrax.core.agent.AgentConfig
import ai.kastrax.core.llm.DeepSeekProvider
import ai.kastrax.voice.CompositeVoice
import ai.kastrax.voice.DeepSeekVoice
import ai.kastrax.voice.ElevenLabsVoice
import ai.kastrax.voice.GoogleVoice
// Create an agent with multiple voice providers
val agent = Agent(
config = AgentConfig(
name = "MultiProviderVoiceAssistant",
description = "An assistant using multiple voice services",
instructions = "You are a helpful assistant with advanced voice capabilities.",
llmProvider = DeepSeekProvider(apiKey = "your-deepseek-api-key"),
// Create a composite voice using different providers for different functions
voice = CompositeVoice(
// Use DeepSeek for speech recognition (STT)
input = DeepSeekVoice(
apiKey = "your-deepseek-api-key",
config = DeepSeekVoiceConfig(
recognitionModel = "whisper-large-v3"
)
),
// Use ElevenLabs for high-quality speech synthesis (TTS)
output = ElevenLabsVoice(
apiKey = "your-elevenlabs-api-key",
defaultVoice = "Rachel",
stability = 0.7f,
similarityBoost = 0.5f
)
)
)
)
// You can also create more complex combinations
val advancedAgent = Agent(
config = AgentConfig(
name = "AdvancedVoiceAssistant",
description = "An assistant with specialized voice capabilities",
instructions = "You are a helpful assistant with advanced voice capabilities.",
llmProvider = DeepSeekProvider(apiKey = "your-deepseek-api-key"),
// Create a composite voice with fallback providers
voice = CompositeVoice(
// Primary input provider with fallback
input = CompositeVoice.InputWithFallback(
primary = DeepSeekVoice(apiKey = "your-deepseek-api-key"),
fallback = GoogleVoice(apiKey = "your-google-api-key")
),
// Primary output provider with fallback
output = CompositeVoice.OutputWithFallback(
primary = ElevenLabsVoice(apiKey = "your-elevenlabs-api-key"),
fallback = DeepSeekVoice(apiKey = "your-deepseek-api-key")
)
)
)
)
This approach allows you to:
- Use specialized providers for different voice functions
- Implement fallback mechanisms for improved reliability
- Optimize for different quality, cost, or performance requirements
- Mix and match providers based on specific use cases
Working with Audio Streams ✅
The speak()
and listen()
methods in Kastrax work with Kotlin’s InputStream
and OutputStream
for handling audio data. Here’s how to work with audio files and streams:
Saving Speech Output
The speak
method returns an input stream that you can save to a file or send to speakers:
import ai.kastrax.voice.config.SpeakOptions
import ai.kastrax.voice.AudioFormat
import java.io.File
import kotlinx.coroutines.runBlocking
fun saveAgentSpeech() = runBlocking {
// Generate speech with custom options
val audioStream = agent.voice.speak(
text = "Hello, world! This is a demonstration of Kastrax voice capabilities.",
options = SpeakOptions(
format = AudioFormat.MP3,
voice = "female-1",
speed = 1.1f, // Slightly faster than normal
pitch = -0.2f // Slightly lower pitch
)
)
// Save the audio to a file
val outputFile = File("agent_speech.mp3")
audioStream.use { input ->
outputFile.outputStream().use { output ->
input.copyTo(output)
}
}
println("Speech saved to: ${outputFile.absolutePath}")
}
Processing Audio in Memory
You can also process audio data in memory for more advanced use cases:
import ai.kastrax.voice.config.SpeakOptions
import ai.kastrax.voice.AudioFormat
import java.io.ByteArrayOutputStream
import kotlinx.coroutines.runBlocking
fun processAudioInMemory() = runBlocking {
// Generate speech
val audioStream = agent.voice.speak(
text = "Processing audio in memory",
options = SpeakOptions(format = AudioFormat.WAV)
)
// Read the entire audio stream into memory
val audioData = ByteArrayOutputStream().use { output ->
audioStream.use { input ->
input.copyTo(output)
}
output.toByteArray()
}
// Now you can process the audio data in memory
println("Generated ${audioData.size} bytes of audio data")
// Example: Apply audio processing (e.g., volume adjustment)
val processedAudio = applyAudioProcessing(audioData)
// Save the processed audio
File("processed_audio.wav").writeBytes(processedAudio)
}
// Example audio processing function
fun applyAudioProcessing(audioData: ByteArray): ByteArray {
// This is a simplified example - in a real application,
// you would implement actual audio processing logic here
return audioData // Return unmodified for this example
}
Transcribing Audio Input
The listen
method accepts an input stream of audio data from a microphone or file:
import ai.kastrax.voice.config.ListenOptions
import ai.kastrax.voice.AudioFormat
import java.io.File
import kotlinx.coroutines.runBlocking
fun transcribeAudioFile() = runBlocking {
// Read audio file
val audioFile = File("user_input.mp3")
try {
println("Transcribing audio file...")
val transcription = agent.voice.listen(
audioStream = audioFile.inputStream(),
options = ListenOptions(
format = AudioFormat.MP3,
language = "en-US",
model = "large", // Use large model for better accuracy
promptHint = "technical discussion" // Provide context hint
)
)
println("Transcription: $transcription")
// Generate a response based on the transcription
val response = agent.generate(
input = transcription,
sessionId = "voice-user-123",
conversationId = "voice-session-456"
)
// Convert the response to speech
val responseAudio = agent.voice.speak(response.text)
File("agent_response.mp3").outputStream().use { output ->
responseAudio.copyTo(output)
}
} catch (e: Exception) {
println("Audio transcription error: ${e.message}")
}
}
Real-time Voice Conversations ✅
Kastrax supports real-time, bidirectional voice conversations through its realtime voice providers. This enables natural, continuous speech interactions between users and agents:
import ai.kastrax.core.agent.Agent
import ai.kastrax.core.agent.AgentConfig
import ai.kastrax.core.llm.DeepSeekProvider
import ai.kastrax.voice.realtime.RealtimeVoice
import ai.kastrax.voice.realtime.RealtimeVoiceConfig
import ai.kastrax.voice.realtime.VoiceEvent
import ai.kastrax.tools.SearchTool
import ai.kastrax.tools.CalculatorTool
import ai.kastrax.audio.MicrophoneStream
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.launch
fun main() = runBlocking {
// Initialize the realtime voice provider
val realtimeVoice = RealtimeVoice(
config = RealtimeVoiceConfig(
apiKey = "your-api-key",
model = "deepseek-chat",
voice = "female-1",
responseSpeed = 0.9f, // Slightly faster response generation
interruptible = true // Allow interrupting the agent while speaking
)
)
// Create an agent with real-time voice capabilities
val agent = Agent(
config = AgentConfig(
name = "RealtimeVoiceAssistant",
description = "An assistant with real-time voice conversation capabilities",
instructions = "You are a helpful assistant capable of real-time voice conversations. Respond naturally and concisely.",
llmProvider = DeepSeekProvider(apiKey = "your-deepseek-api-key"),
tools = listOf(
SearchTool(),
CalculatorTool()
),
voice = realtimeVoice
)
)
// Set up event listeners
setupVoiceEventListeners(agent)
// Establish connection to the voice service
agent.voice.connect()
// Start the conversation with a greeting
agent.voice.speak("Hello, I'm your AI assistant. How can I help you today?")
// Get microphone input stream
val microphoneStream = MicrophoneStream.create()
// Send microphone audio to the agent
launch {
agent.voice.send(microphoneStream)
}
// Keep the application running until user ends the conversation
println("Conversation started. Press Enter to end the conversation.")
readLine()
// Clean up when done
microphoneStream.close()
agent.voice.close()
println("Conversation ended.")
}
// Set up event listeners for the voice system
fun setupVoiceEventListeners(agent: Agent) {
// Listen for speech audio data from the voice provider
agent.voice.on(VoiceEvent.SPEAKING) { event ->
val audio = event.audio
// Process the audio data (e.g., play through speakers)
playAudioThroughSpeakers(audio)
}
// Listen for transcribed text from both the voice provider and user
agent.voice.on(VoiceEvent.WRITING) { event ->
println("${event.role}: ${event.text}")
// You can also save the conversation to a transcript
if (event.isFinal) {
saveToTranscript(event.role, event.text)
}
}
// Listen for thinking events (when the agent is processing)
agent.voice.on(VoiceEvent.THINKING) { event ->
println("Agent is thinking...")
// You could display a visual indicator here
}
// Listen for errors
agent.voice.on(VoiceEvent.ERROR) { event ->
println("Voice error: ${event.error}")
}
}
// Example function to play audio through speakers
fun playAudioThroughSpeakers(audio: InputStream) {
// Implementation would depend on your audio playback library
// This is a placeholder for actual audio playback code
}
// Example function to save conversation to transcript
fun saveToTranscript(role: String, text: String) {
File("conversation_transcript.txt").appendText("$role: $text\n")
}
Advanced Real-time Features
Kastrax’s real-time voice system supports several advanced features:
// Enable voice activity detection to automatically detect when the user stops speaking
val voiceWithVAD = RealtimeVoice(
config = RealtimeVoiceConfig(
apiKey = "your-api-key",
model = "deepseek-chat",
voice = "female-1",
voiceActivityDetection = VoiceActivityDetectionConfig(
enabled = true,
silenceThreshold = 0.3f, // Level of silence to detect end of speech
silenceDuration = 1000 // Milliseconds of silence to trigger end of speech
)
)
)
// Enable streaming responses for faster agent replies
val streamingVoice = RealtimeVoice(
config = RealtimeVoiceConfig(
apiKey = "your-api-key",
model = "deepseek-chat",
voice = "female-1",
streamingMode = StreamingMode.INCREMENTAL, // Start speaking before full response is generated
chunkSize = 20 // Words per chunk for incremental speaking
)
)
// Enable voice commands for controlling the conversation
val voiceWithCommands = RealtimeVoice(
config = RealtimeVoiceConfig(
apiKey = "your-api-key",
model = "deepseek-chat",
voice = "female-1",
voiceCommands = listOf(
VoiceCommand("stop", "Stop the current response"),
VoiceCommand("pause", "Pause the conversation"),
VoiceCommand("resume", "Resume the conversation")
)
)
)
Supported Voice Providers ✅
Kastrax supports multiple voice providers for text-to-speech (TTS) and speech-to-text (STT) capabilities:
Provider | Package | Features | Reference |
---|---|---|---|
DeepSeek | ai.kastrax.voice.DeepSeekVoice | TTS, STT | Documentation |
DeepSeek Realtime | ai.kastrax.voice.realtime.DeepSeekRealtimeVoice | Realtime speech-to-speech | Documentation |
ElevenLabs | ai.kastrax.voice.ElevenLabsVoice | High-quality TTS | Documentation |
ai.kastrax.voice.GoogleVoice | TTS, STT | Documentation | |
Azure | ai.kastrax.voice.AzureVoice | TTS, STT | Documentation |
Whisper | ai.kastrax.voice.WhisperVoice | STT | Documentation |
For more details on voice capabilities, see the Voice API Reference.
Integration with Actor Model ✅
One of Kastrax’s unique features is the integration of the voice system with the actor model, enabling distributed voice processing:
import ai.kastrax.actor.ActorSystem
import ai.kastrax.actor.Props
import ai.kastrax.voice.VoiceActor
import ai.kastrax.voice.DeepSeekVoice
import ai.kastrax.voice.messages.*
import kotlinx.coroutines.runBlocking
fun main() = runBlocking {
// Create an actor system
val system = ActorSystem("voice-system")
// Create a voice actor
val voiceActor = system.actorOf(
Props.create(VoiceActor::class.java, DeepSeekVoice(apiKey = "your-api-key")),
"voice-actor"
)
// Send a speech synthesis message
val result = system.ask<VoiceResult>(
voiceActor,
SpeakMessage("Hello, I am a voice actor!")
)
// Process the result
when (result) {
is VoiceResult.Success -> {
val audioData = result.audio
// Process the audio data...
File("actor_speech.mp3").outputStream().use { output ->
audioData.copyTo(output)
}
println("Speech generated successfully")
}
is VoiceResult.Error -> {
println("Voice processing error: ${result.message}")
}
}
// Send a speech recognition message
val audioFile = File("input.mp3")
val transcriptionResult = system.ask<VoiceResult>(
voiceActor,
ListenMessage(audioFile.inputStream())
)
// Process the transcription result
when (transcriptionResult) {
is VoiceResult.Success -> {
val transcription = transcriptionResult.text
println("Transcription: $transcription")
}
is VoiceResult.Error -> {
println("Transcription error: ${transcriptionResult.message}")
}
}
// Shutdown the actor system when done
system.terminate()
}
This integration enables building sophisticated multi-agent systems where voice processing can be distributed across different nodes and executed concurrently.
Best Practices ✅
When implementing voice capabilities in your Kastrax agents, consider these best practices:
-
Choose the Right Provider: Select voice providers based on your specific requirements for quality, latency, and language support.
-
Handle Errors Gracefully: Implement robust error handling for network issues, service unavailability, or audio processing failures.
-
Optimize Audio Settings: Configure audio format, quality, and compression based on your bandwidth and storage constraints.
-
Consider Privacy: Be transparent about audio recording and processing, and implement appropriate data retention policies.
-
Test with Real Users: Voice interfaces require extensive testing with diverse accents, background noise conditions, and use cases.
-
Provide Visual Feedback: When using voice in applications with visual interfaces, provide feedback about listening and speaking states.
-
Implement Fallbacks: Always provide text-based alternatives for situations where voice interaction isn’t possible or fails.
-
Monitor Performance: Track metrics like speech recognition accuracy, response times, and user satisfaction to continuously improve your voice interface.
By following these guidelines, you can create Kastrax AI Agents with robust voice capabilities that provide natural and effective user interactions.