Skip to Content

voice.speak() ✅

The speak() method is a core function available in all Kastrax voice providers that converts text to speech. It takes text input and returns an audio stream that can be played or saved.

Usage Example ✅

import { OpenAIVoice } from "@kastrax/voice-openai"; // Initialize a voice provider const voice = new OpenAIVoice({ speaker: "alloy", // Default voice }); // Basic usage with default settings const audioStream = await voice.speak("Hello, world!"); // Using a different voice for this specific request const audioStreamWithDifferentVoice = await voice.speak("Hello again!", { speaker: "nova", }); // Using provider-specific options const audioStreamWithOptions = await voice.speak("Hello with options!", { speaker: "echo", speed: 1.2, // OpenAI-specific option }); // Using a text stream as input import { Readable } from "stream"; const textStream = Readable.from(["Hello", " from", " a", " stream!"]); const audioStreamFromTextStream = await voice.speak(textStream);

Parameters ✅

input:

string | NodeJS.ReadableStream
Text to convert to speech. Can be a string or a readable stream of text.

options?:

object
Options for speech synthesis

options.speaker?:

string
Voice ID to use for this specific request. Overrides the default speaker set in the constructor.

Return Value ✅

Returns a Promise<NodeJS.ReadableStream | void> where:

  • NodeJS.ReadableStream: A stream of audio data that can be played or saved
  • void: When using a realtime voice provider that emits audio through events instead of returning it directly

Provider-Specific Options ✅

Each voice provider may support additional options specific to their implementation. Here are some examples:

OpenAI

options.speed?:

number
= 1.0
Speech speed multiplier. Values between 0.25 and 4.0 are supported.

ElevenLabs

options.stability?:

number
= 0.5
Voice stability. Higher values result in more stable, less expressive speech.

options.similarity_boost?:

number
= 0.75
Voice clarity and similarity to the original voice.

Google

options.languageCode?:

string
Language code for the voice (e.g., 'en-US').

options.audioConfig?:

object
= { audioEncoding: 'LINEAR16' }
Audio configuration options from Google Cloud Text-to-Speech API.

Murf

options.properties.rate?:

number
Speech rate multiplier.

options.properties.pitch?:

number
Voice pitch adjustment.

options.properties.format?:

'MP3' | 'WAV' | 'FLAC' | 'ALAW' | 'ULAW'
Output audio format.

Realtime Voice Providers ✅

When using realtime voice providers like OpenAIRealtimeVoice, the speak() method behaves differently:

  • Instead of returning an audio stream, it emits a ‘speaking’ event with the audio data
  • You need to register an event listener to receive the audio chunks
import { OpenAIRealtimeVoice } from "@kastrax/voice-openai-realtime"; import Speaker from "@kastrax/node-speaker"; const speaker = new Speaker({ sampleRate: 24100, // Audio sample rate in Hz - standard for high-quality audio on MacBook Pro channels: 1, // Mono audio output (as opposed to stereo which would be 2) bitDepth: 16, // Bit depth for audio quality - CD quality standard (16-bit resolution) }); const voice = new OpenAIRealtimeVoice(); await voice.connect(); // Register event listener for audio chunks voice.on("speaker", (stream) => { // Handle audio chunk (e.g., play it or save it) stream.pipe(speaker) }); // This will emit 'speaking' events instead of returning a stream await voice.speak("Hello, this is realtime speech!");

Using with CompositeVoice ✅

When using CompositeVoice, the speak() method delegates to the configured speaking provider:

import { CompositeVoice } from "@kastrax/core/voice"; import { OpenAIVoice } from "@kastrax/voice-openai"; import { PlayAIVoice } from "@kastrax/voice-playai"; const voice = new CompositeVoice({ speakProvider: new PlayAIVoice(), listenProvider: new OpenAIVoice(), }); // This will use the PlayAIVoice provider const audioStream = await voice.speak("Hello, world!");

Notes ✅

  • The behavior of speak() may vary slightly between providers, but all implementations follow the same basic interface.
  • When using a realtime voice provider, the method might not return an audio stream directly but instead emit a ‘speaking’ event.
  • If a text stream is provided as input, the provider will typically convert it to a string before processing.
  • The audio format of the returned stream depends on the provider. Common formats include MP3, WAV, and OGG.
  • For best performance, consider closing or ending the audio stream when you’re done with it.
Last updated on