Speech Engines

Modern speech recognition and synthesis models are powerful but heavy — large downloads, slow to load, and resource-intensive. Every application that wants voice interaction has to implement its own loading, optimization, and lifecycle management. Running a model for a few voice commands and then discarding it is wasteful.

Dictare solves this by keeping STT and TTS engines loaded in memory as a background service. The models are loaded once at startup, optimized for your hardware, and stay ready. Any application can use them instantly through the OpenVIP protocol — no model loading, no ML dependencies, no GPU management. Just speak and listen.

All engines run 100% locally. No audio ever leaves your machine.

Speech-to-Text (STT)

Engine Selection

Dictare automatically selects the best STT engine for your platform:

Model config Engine Runtime Platform
tiny to large-v3-turbo MLXWhisperEngine MLX macOS Apple Silicon
tiny to large-v3-turbo FasterWhisperEngine CTranslate2 Linux / Intel Mac
parakeet-v3 ParakeetEngine ONNX Runtime Any

Models

Model Size Speed Accuracy Best for
tiny ~75 MB Fastest Lower Testing, low-resource
base ~140 MB Fast Moderate Quick dictation
small ~460 MB Moderate Good General use
medium ~1.5 GB Slower High Accuracy-focused
large-v3 ~3 GB Slowest Highest Best accuracy
large-v3-turbo ~1.6 GB Fast High Recommended
parakeet-v3 ~600 MB Fast High Cross-platform

Configuration

[stt]
model = "large-v3-turbo"    # Recommended default
language = "auto"            # Auto-detect, or set "en", "it", etc.
translate = false            # Translate to English
hw_accel = true              # Use GPU/NPU acceleration

[stt.advanced]
device = "auto"              # auto, cpu, cuda, mlx
compute_type = "float16"     # int8 (faster), float16 (balanced), float32 (precise)
hotwords = ""                # Help recognition: "Dictare, OpenVIP, Claude"

Model Management

dictare models               # List available models
dictare models download      # Download a model
dictare models delete        # Remove a cached model

Hotwords

Improve recognition of technical terms and names:

[stt.advanced]
hotwords = "Dictare, OpenVIP, Claude Code, Codex, pytest"

Text-to-Speech (TTS)

Available Engines

Engine Platform Quality Speed Notes
espeak Any Basic Instant Built-in, no download
say macOS Good Instant Uses macOS system voices
piper Any Good Fast ONNX-based, many voices
kokoro Any High Moderate Neural, natural-sounding
outetts Any High Slower Neural TTS

Configuration

Your default engine is set in config.toml:

[tts]
engine = "say"       # Your preferred engine (always loaded, fastest)
language = "en"      # Voice accent
voice = ""           # Engine-specific speaker name

Using Multiple Engines

You can use any installed engine on the fly — even if it's not your default. Pass --engine to dictare speak:

dictare speak "Build complete"                         # Uses default engine
dictare speak "Test passed" --engine kokoro             # Uses kokoro for this request
dictare speak "Deploying" --engine piper                # Uses piper for this request

The first time you use a non-default engine, it takes a moment to load. After that, the audio is cached — same text + engine + language + voice = instant playback from cache.

dictare speak --list-engines     # Show all available engines
dictare speak --list-voices      # Show voices for current engine

TTS Worker Isolation

Kokoro, Piper, and OuteTTS engines run in an isolated subprocess worker. This prevents their dependencies from interfering with the main Dictare process. The worker is managed automatically.

Performance Tips

  • macOS Apple Silicon: Use large-v3-turbo with MLX for the best speed/accuracy balance
  • Linux with NVIDIA GPU: Use large-v3-turbo with CUDA (device = "cuda")
  • CPU-only machines: Use small or parakeet-v3 for reasonable speed
  • Low memory: Use tiny or base; set compute_type = "int8" to reduce memory usage
  • Hotwords: Always set hotwords for domain-specific terms you use frequently